High speed and low power arithmetic design using limited switching dynamic logic by Nguyen, Cong
Lehigh University
Lehigh Preserve
Theses and Dissertations
2004
High speed and low power arithmetic design using
limited switching dynamic logic
Cong Nguyen
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Nguyen, Cong, "High speed and low power arithmetic design using limited switching dynamic logic" (2004). Theses and Dissertations.
Paper 836.
Nguyen, Cong
High Speed and
Low Power
Arithmetic Design
using Limited
Switching
Dynamic Logic
May 2004
High speed and low power arithmetic
desigD,u~g Limited Switching Dynamic Logic
by
CONG NGUYEN
AThesis
Presented to the Graduate and Research Committee
of
Lehigh University
In Candidacy for the degree of Master of Science
in
The department of Electrical and Computer Engineering
at
Lehigh University
May 2004

ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to my manager, Mr. Marc Lang, for
his understanding and supports in terms of encouragement and getting the necessary
paper work in order to get the financial grant from IBM for the course of my study and,
doing this research on this subject and also making the resources available for me to work
on this project. At the same time, I am also sincerely grateful for Dr. Marvin White who
advised and gave me the inspiration to'get started on this thesis and also gave invaluable
suggestions, ideas for this project. Without their supports, this project would not have
been possible.
III
Acknowledgement
List of Figures
Abstract
1. Introduction
Table of Contents
'''"------
iii
xii
1
1.1 Motivation 2
1.2 Thesis overvie\v 3
2. Background and Conventional Dynamic Circuit
2.1 Dynamic and Domino circuits 5
2.2 Bulk CMOS vs. SOl technology: Partially and Fully Depleted SOI.. ....... l0
2.3 Source of power consumption in static and dynamic circuits 12
3. Comput~r Arithmetic Review
3.1 CLA Addition Theory 14
3.2 Adder Architectural descriptions 17
CLA: Second level of Abstraction 18
4. Limited Switching D)'namic style (LSD)
4.1 LSD circuit description and analysis 22
4.2 LSD Clocking and feedback Keeper functionality 30
4.3 Transistor sizing for Dynamic and Static circuits 31
IY
5. Adder Design and Implementation
5.1 64~bit and Static CLA adder designs .40
5.2 Physical design tools and CLA adder layouts .41
5.3 IBM's CMOS8S0 SOl technology .45
5.4 Design checking in physical design .47
6. Adder Evaluations
6.1 Static adder perfonnance evaluations .49
6.2 LSD adder perfonnance evaluations 50
7. Conclusions and future works 55
Appendixes
A. Static schematics and layouts 55
B. LSD schematics and layouts 57
C. SOl Partially Depleted Simplied Equations 61
D. Power Calculation Panels 62
References
Vitae
63
66
List of Figures
Figure 2.1 Dynamic circuit configuration
Figure 2.2 Domino circuit configuration
Figure 3.1 4-bit carry block
Figure 3.2 16-bit CLA
Figure 3.3 64-bit CLA
Figure 4.1 LSD elock generation scheme
Figure 4.2 LSD AOI circuit configuration
Figure 4.3 LSD inverter configuration
Figure 4.4 LSD inverter with elk=O and regardless of input A, Z=1
Figure 4.5 LSD inverter configuration, elk=1 and A=I, Z--O
Figure 4.6 LSD inverter with elk=1, A=1 and Z--O
Figure 4.7 DynlDom and Lsd test circuits
Figure 4.8 LSD's inverter wave-form outputs
Figure 4.9 Sizing Static inverter
Figure 4.10 Sizing LSD circuit (AOI)
Figure 5.1 flowchart for Virtuoso XL
Figure 5.2 CMOS physical design methodology flow
y\
7
8
19
20
21
22
23
25
26
27
28
28
29
37
38
42
44
List of Tables
Table 2.1 IBM's activity coefficients
Table 4.1 IBM CMOS design parameters
Table 4.2 IBM CMOS design guideline for stacked devices
Table 5.1 The IBM CMOS8S0 features
Table 6.1 Comparison between Static and LSD adders
YII
13
32
33
45
51
Abstract
This thesis presents a 64-bit carry look ahead adder (CLA) using a level-sensitive
dynamic logic topology design known at IBM as a Limited Switching Dynamic (LSD)
circuit. This style of circuit is typically used in data paths oriented in a highly critical path
application due to its compactness and relatively low power consumption compared to
other circuit dynamic logic styles. The advantages offered by the LSD circuit style are
high density coupled with a single rail power supply design. This 64-bit Carry Look
Ahead Adder is a compact design using various techniques and tools currently available
at IBM to design and implement circuits from schematics to layouts. Some aspects of the
SOl technology is presented and briefly discussed. The LSD adder is more compact and
switches at the same clock rate or higher, requires more tuning, and consumes
approximately 10 to 15 percent less power than its static counterpart. A full description of
the LSD circuit operation is given and as well as the description of the architecture for the
implemented LSD Carry Look Ahead Adder.
Chapter I
Introduction
In a modem Digital Signal Processing IC and high speed microprocessors, the
data path is the core where all the arithmetic and highly intensive operation are carried
out and performed. A system's performance is largely determined by the design and
implementation of its architecture, data paths and finally the logic circuit styles. Over the
year may types of dynamic logic design styles were proposed and implemented
1.1 Motivation
Computer arithmetic plays an important part in designing a high performance
microprocessor. Since very often arithmetic operations are being used to compute various
types of memory addresses and also performing addition, multiplication and division
therefore designing fast arithmetic circuits and low power consumption are very
important because arithmetic circuits are architecturally complex that require large
number of transistors to implement as a result more and more transistors being packed
into smaller silicon in order to accommodate all the features. Apparently, high transistors
count does not only lead to high power consumption but also reduce system liabilities due
to running at high temperature and not to mention the cost of cooling these systems. To
remedy this problem, dynamic circuits are often utilized more and more in the design
where data path oriented designs become bottleneck in temlS of performance since
dynamic circuits are compact and potentially have low power consumption even though
dynamic circuit topology has some. well known drawbacks such as charge leakage,
charge sharing. clock-skewcd intolerant. and non invcrsion outputs.
Despite the fact that the currcnt available dynamic circuit topology offcrs
significant advantage ovcr other logic circuit stylcs in tcnns of circuit dcnsity. rclatively
low powcr consumption and highcr clock spccd. Howcvcr thc VLSI dcsign community is
still somcwhat reluctant to adapt dynamic or domino circuit style ovcr thc wcll known
,
other CMOS circuit topology such as static due to reliability issues, charge leakage,
power consumption and especially low noise margin.
This proposed LSD (Limited Switching Dynamic) circuit is an attempt to address
some of the problems encountered by the conventional dynamic as well as domino logic
circuit families. Static CMOS has been well known for its reliability and noise tolerance
compare to other circuit topologies. As a case in point, IBM CMOS static circuits still
\
play an essential and dominant part in microprocessor design and implementation, but
dynamic logic circuits have slowly gained more and more acceptance and have been
employed in Gigaherz processor design methodology. This is due in part to their high
performance and compactness as microprocessor switching in static CMOS circuits
becomes more difficult to achieve with increased clock frequency which results in high
power consumption. Several types of dynamic and domino circuit styles are currently
being implemented in microprocessor design methodology; however, LSD circuit
topology is often chosen over other dynamic logic style since LSD offers significant
advantages in comparison with conventional dynamic logic due to its latching and single
rail capability.
1.2 Thesis overview
The thesis is organized as follows: Chapter I is an introduction and Chapter 2
briefly goes ~ver some background information, including a discussion of previous work,
which includes domino and dynamic logic design styles with single phase clock. These
logic circuit styles arc constructed either with latches or without latches. Chapter 2 also
reviews and describes some of the basic dynamic and domino circuits and their
,
functionality with a discussion of some of the shortfalls of these circuits. Also, chapter 2
describes the need for high speed circuit design styles for data-path oriented design and
explores yarious techniques and design issues using Silicon on Insulator (SOl)
technology.
Chapter 3 reviews the computer arithmetic of addition by providing a summary of
. the Carry Look Ahead (CLA) Adder Boolean derivations which are used to implement
the CLA Adders. Chapter 4 provides a detail description of the Limited Switching
Dynamic (LSD) circuit in terms of circuit analysis, functionality, and the pros and cons of
the LSD implementation are discussed. Chapter 5 discusses the sizing of static and
dynamic transistors including the static design methodology currently being implemented
at IBM in order to quickly perform transistor sizing. For sizing transistor in dynamic
circuits, an iteration method is used to determine the transistor sizes using IBM's
methodologies and tools.
Both static and dynamic CLA Adder implementations are also discussed in
Chapter 5 with IBM's standard library cells along with the custom design of dynamic
circuits. The designs are verified using Cadence tools, such as Design Rule Check
(DRC), Layout Versus Schematic (LVS) along with other IBM's internally developed
tools to further check the design for beta, clock-lines and other yield issues. Also,
Chapter 5 presents the designs of both LSD and static CLA Adders using CMOS8S0
IBM's Sal technology to implement the physical design of the Adders. The features of
IBM's CMOS8S0 Partially-Depleted Silicon on Insulator (PDSOI) technology will be
briefly discussed in chapter 5.
Finally, the evaluations of both CLA Adaers, static and dynamic, are presented in
Chapt~ 6 in terms of chip area, pcrfonnance and power consumption. Chapter 7 is a
summary which includes some thoughts of looking forward to future works and
applications for LSD circuit utilization at IBM.
Chapter II
Backgroulld alld COllvelltiollal Dyllamic Circuit
This section briefly describes a few of the conventional dynamic and domino
logic design styles and includes a general discussion of circuits with these topologies.
The dynamic and domino circuits are slowly gaining acceptance due to their high-speed
and lower power consumption, especially in the Silicon on Insulator (SOl) CMOS
process. In general, some authors concentrate on simplicity of design, low cost and
reduced area with improved perfonnance [1] and [2], while other authors focus on the
issues of more speed and elaborated architectures. At the same, there are also papers that
describe dynamic design techniques using pipeline latches in attempt to reduce power
consumption with increasing clock speed. For example, high speed CMOS design has
been discussed with different techniques to implement dynamic circuits using single and
multiple phase clocking [1]. However, the subject of dynamic and domino logic is broad
and there are many different topologies and styles currently available therefore this thesis
will focus on the conventional and basic configurations of dynamic and domino logic
styles.
2.1 Dynamic and Domino Circuits
Over the years many papers have proposed and studies performed on dynamic and
domino circuits with techniques ranging from simple to complex architecture
implementations. These papers focused on improvement in circuit perfonnance, while
.
reducing problems associated with the use of dynamic topologies. such as charge sharing.
leakage. clock skew sensitivity. and noise intolerance. Inherently. dynamic CMOS
circuit structures are sensitive to noise and have lower noise margins compare to static
C~10S structures. Since there are a vast number of papers on dynamic circuits. this thesis
will confine the disc~"sion to the area of circuit structures and the subject of low power
and perf0n11ance.
5
Dynamic logic is a circuit style well suited for high-performance microprocessor
design. It_ offers significant performance over the competing static logic style, with the
same or better performance, reduced ~hip area, and lower power consumption. However,
dynamic circuits are more susceptible to noise, charge leakage, crosstalk, and sensitivity
to clock-skewing. Limited Switching Dynamic (LSD) is similar to the dynamic logic
circuit family except LSD circuits have the ability to latch and store a value, which is
helpful to reduce power consumption.
In static logic, the output nodes are continually driven - either high or low
depending on the input logic and the logic circuit spends very little time during a
transition from low to high and vice versa. Thus, there is virtually no DC path between
logic states. There is a small a mount of current called 'feed-through', which is dynamic
with a time depending on the sizing of the circuits. Unlike the case of a static circuit,
however, the output node of a dynamic circuit is pulled high, low, or floating with the
logic value stored on the output node capacitance. This value is changed when the inputs
to the logic circuit make a transition from low to high and the clock is high or in an
evaluation mode. A typical dynamic circuit has a precharge, where the evaluation
transistors are pulled down when the clock is high, and the logic is evaluated at a high
value for the evaluated or 'footed' device.
Figure 2.1 is a typical example of dynamic circuit, which implements an AOI
function. The circuit has two distinctive operations - precharge and evaluation phases. In
the precharge phase, the precharge transistor is ON while the clock is low and the output
is forced to a high state for a period of time. As soon as the clock makes a transition to
high. the evaluated or 'footed' device is ON. Next. when the data arrives on time.
depending on the value of the input logic state, a decision is made whether or not a
complete path will be formed so current can flow and discharge to ground. If thc input
logic states are all high. then the result is evaluated as a high and a complete discharge
path to ground for the current. thereby. causing the output to stay low. Otherwise. thc
inputs must rcmain stablc in thc cvaluation phasc or transition monotonically from high
6
to low only during the evaluation phase where the output can only make a transition from
high to low.
elk __---r----~)
PO
t--_d_Y!_1d_n_i:_n_G~_je_·----II. ab+e
f (a,b,G)
e~
NO
Figure 2.1 Dynamic circuit configuration
The dynamic circuit has an inverted output and extremely sensitive to clock skew
and data, charge leakage and crosstalk. Thus, the dynamic circuit can not be c<,!scaded
unless the clock and data have been properly delayed in order to synchronize with the
arrival of the data inputs, otherwise, the circuit will be subject to output errors, and the
data is unrecoverable. To remedy this problem, as shown in Figure 2.2, an inverter is
added to the output of the dynamic circuit and, in addition, a feedback device known as a
"keeper", is included to help prevent charge sharing and leakage, while synchronizing the
clock and data to the input of the circuit. This is a commonly used technique in
dynamic/domino structures. The domino gate is composed of an NMOS pulled-down
device and a pull-up PMOS device followed by an inverter. The addition of an inverter
pennits domino circuits to be cascaded where the output of each stage is driven by the
output of 'the preceding domino stage. The presence of an inverter guarantees the inputs
of each stage arc all dc-asserted prior to the start of an evaluation phase. Similar to
7
dynamic circuits, domino circuits can only have non-inverting outputs and the input can
only be a slowly varying value or non-inverting function.
clk~---r---<.]
PO
}--............ ab+c
f (a,b,c)
c~
NO
Figure 2.2 Domino circuit configuration
There are many types of dynamic and domino logic present in different logic
families, but in this thesis we will limit our discussion and concentrate on a single rail
dynamic/domino circuit with a half-latch including a keeper since the LSD logic style is
closely resembles the previous logic families.
The circuits mentioned above are susceptible to noise and charge leakage. The
conventional dynamic/domino circuits continuously switch the outputs during a pre-
charge and evaluation modes when the input logic is high; however, when the input logic
is low the output is latched since the dynamic node is floating. Consequently. the
dynamic node is being held high by the output node capacitance during the evaluation
and precharge phases. In this condition. the output stays low until the input making a
transition from high to low. This is because the feedback node is tied directly to the
dynamic node and the 'keeper' pulls up this node when the output going low. In addition.
there is no isolation between the dynamic node and the output inyerter as this will cause
s
the output to continue to toggle resulting in high power consumption. This is because the
power consumption depends highly on input values, operational frequency, switching
activity, and the nodes and load capacitance of the dynamic circuit.
/
In addition to the issue of high power consumption, there are concerns for single
rail power supply dynamic logic since there are no non-inverting outputs available,
consequently dynamic circuit is hard to use in conjunction with static or other logic
circuit styles due to this incompatibility. In ord~r to maintain compatibility among other
logic circuit families, inverted outputs must also be available so the dynamic/domino
interface circuitry will not be required. To generate inverted outputs for dynamic or
domino logics, dual rail power supply methodology is needed, which requires extra
complementary logics to generate the inverted outputs. As a result, the dual rail power
supply methodology does not only require more time, effort and extra circuitry but also
make the design more complex and higher power consumption. High device count does
not only take more area but also complicates and requires more wiring tracks and also
makes the cell becomes more irregular since the wiring track, in metal 1, usually
determine how large the cell become based on the metal 1 track wiring availability of a
cell. Typically, the output of a dynamic circuit drives the input of other directly of static
gate and dynamic, without the inverter the input dynamic gate reduces the noise margin
when noise occurs it's difficult for the circuit to recover the errors, the cycle needs to be
stretched giving more time for the circuit to recover however there is a time that the
-circuit can't from the erroneous states.
Dynamic circuits are also sensitive to clock delay and subject to noise margins
being reduced because the output is not completely latched in the use of the dynamic
circuit without the inverter. For domino circuit, the output is latched the presence of the
half-latch when the output is low when the clock delay differential is great the output will
prematurely switch leading to an erroneous state which it can't recovcr, also the presence
of the inwrter could also help the noise margin but very often the input is sized, beta, to
favor the rising/falling cdges which reduces the ovcr all noise margin of the circuit since
there is no isolation between the output inverter and the dynamic node.
9
2.2 Bulk CMOS vs. SOl technology: Partially and Fully Depleted SOl
For the past few decades Bulk CMOS technology has been advancing along in an
exponential path and of shrinking device dimensions, increasing density, increasing speed
and decreasing cost. The scaling of CMOS technology has progressed rapidly for the last
few decades, however this may soon come to an end because the power dissipation
constraints, leakage current and various other issues. The primary problem is static power
dissipation, which is caused by leakage current from channel tunneling and others, also
the level of integration on VLSI chip is not only bigger but also having more complex
functions in a single chip that leads to extremely high number of transistors count. The
motivation for moving to SOl clearly is to extent scalability of CMOS and to continue to
enjoy performance as well as density benefits of migrating design and architecture to the
next generation lithography.
Silicon-on-insulator (SOl) offers a 15% to 35% performance gam over bulk
CMOS. High-performance microprocessors using CMOS SOl have been designed and
available for production. As the technology moves to the a.13um generation, the SOl
CMOS is being used in more and more application and spreading to lower-end
microprocessors and memory products. As we move to the 0.1 urn generation and
beyond, SOl CMOS is expected to be the technology of choice for system-on-a-chip
(SOC) application which requires high performance CMOS device, high density and
lower power consumption.
Currently, there are two ways to implement the SOl device, one is fully depleted
(FO) and the other partially depleted (PO) techniques. Although, both are designed to
reduce the body and other junction capacitances but the partially depleted process gains
more acceptance duc to being more compatible to the currcnt bulk CMOS with little
change in the proccss and the financial and practicality. Thc fully depleted. FO. achieycs
more significant rcduction in SID junction and body capacitanccs compare to thc PO
howeycr to achicyc this significant gain. a n~al1y thin film. si. is rcquircd and this makcs
it morc expcnses and highly incompatible to thc currcnt bulk CMOS proccss without a
10
major retooling and financial investment is too great for a company. and company.is
reluctant to take on, as a result the PD process currently gains more acceptance in the
main stream with some performance tradeoffs.
PD-SOI can be thought of as a evolution than revolutionary device structure, in
that with the exception of wafer SIMOX (Separation by Implantation of Oxygen)
formation, the CMOS wafer fabrication is performed on the same bulk CMOS !ool set,
targeting parameters which are very similar to or identical to bulk CMOS. Scalability: A
device definition which, when scaled through successive lithography, continuing to
produce a viable product is very important. Because PD-SOI does not require the ultra
thin active silicon thickness typically used in full-depleted SOl (FD), investments in its
developed can be returned over more generations. Using identical materials, tools,
process, and parameters target as in is bulk predecessor, the spatial variability in SOl
electrical parameters is well established and accommodated in software tools commonly
used. That leaves the use dependent variation, which can be shown to be well modeled.
With FD-SOI, the entire body of the transistor is inverted. The threshold voltage,
therefore, is a function of the charge contained in the body, and can vary substantially
across the chip.
It is becoming more evident that continued CMOS scaling into deep submicron
region is causing the limits of bulk substrate CMOS to emerge. With diminishing returns
and perfornlance and higher leakage currents in store for the coming lithography
generation, Partially Depleted SOl becomes more and more attractive option for high
speed and low power design alternative technology. Advances in processing and
fabrication of SIMOX wafers, as weB as a better understanding of SOl circuit behavior
make this technology a candidate for volume application in the present generation. SOl
also known for design liabilities which. because the more simple bulk CMOS option
provided acceptable power delay. discouraged its past usc. With the substantial
advantages SOl is now noted for. it has become worthwhile to examine the more doseh'~ .
its electrical behavior so that robust. complex SOl product may be initiated.
II
2.3 .Source of power consumption in static and dynamic circuits
Static circuit topology, the most significant part of the power consumption occurs
only during transition at the outputs is charging or discharging. The output capacitance is
consists of three components Cint ' C"'ire and C/oad ' Cint represents the internal
capacitance of the gate. This Cint mostly consists of the diffusion capacitance of the drain.
C ...ire represents the capacitance due to the physical interconnections and finally C/oad
represents the sum of the input gates and capacitances of the transistor being fed by the
outputs. The Cint and C/oad mostly depends on specific generation of technology and
can be obtained and pre-calculated from the technology library, however C"ire is highly
unpredictable due to routing dependent and usually depends on the length of the wires
and what layer of metal the and how optimal the connection is made.
There are three major sources of the power dissipation in the static circuits as
described in equation (2.1)
(2.1)
The 1st term represents the switching component of the power, where CL is the loading
capacitance and Ie is the operating frequency of the system, and a is the probability that
a power consumption transition occurs, also known as the activity factor. The 2nd term of
the equation (2.1) is due to the short-circuit when both the NFET and PFET are active
and conducting current for a short period during a transition high to low or low to high.
Finally, the leakage current, IL, which can arise from substrate injection and sub-
threshold effects, is primarily detennined by the technology fabrication under
consideration. In equation (2.1) the dominant terms are a . CL • fc and Vnn . short-
circuit power consumption,
P, = fJi (V - V - V )~
sc 12T nn Tn Tp
and the static power due to leakage current.
Ps = l/_'~n
12
(2.2)
(2.3)
Dynamic power consumption is somewhat different from static'CMOS since a dynamic
gate is clocked and operated in two phases: precharge and evaluation modes. Because of
the precharging and discharging operations, dynamic circuits consume power in both
precharge and discharge. Since the dynamic circuit can also dissipate energy even the
input value staying constant thus the power consumption expression for dynamic logic
circuit can be summarized below,
(2.4)
Similar to the e static power consumption, Pdyn also depends on the operating frequency
Ie and the capacitive load C/ and Vdd 2 except the power consumption is half that of
static circuit. Finally, the switching a probability is defined as,
1. n(N)ao.... l = lm-NN.... ro (2.5)
where N is the number of cycles and n(N) of is the number of a~ 1 transitions in N
clock cycles. The above formulas and coefficients below are the information that ffiM's
Power Calculator uses to estimate the power consumption for the design.
Data types Activity (aacri"ir))
Clock 0.5
Random data signal 0.5
Simple logic driven by random data 0.4 - 0.5
Finite state machines 0.08 - 0.18
Conclusion 0.05 - 0.5
-
Table 2.1: IB~rs activity coefficients
Chapter III
Computer Arithmetic Review
This section is intended to provide some basic background in arithmetic addition
review and explain the basic theory behind the Carry Look Ahead addition algorithm and
its derivation. Two implementations of the CLA, in terms of the level abstraction, are
discussed with a comparison of two CLA architectures.
3.1 CLA addition theory
Since the most critical path of an adder is its carry therefore there have been many
proposals that have been developed over the years. The Carry Look Ahead algorithm
CLA is one of the fastest carry generators for an adder. The CLA computes the carry bit
by using the inputs in parallel. In binary arithmetic addition it is possible to pre-compute
the carry in parallel. In binary addition the sum of two bit binary numbers is describes as
s· =X'},'c. + X')'c~ + x.)"c~ + x,)'.c. -» s· = (x. ffi },.) ffi c·
I I I I I "" I " I I
and the carry,
(3.1)
However if the carry equation is further refined by expanding equation 3.2 a pattern can
be seen, we can redefine an other set of equation to represent the carry generate and
propagate as below,
\
and Pi =ai +h j or (3.3)
'--.
'\
\
where (1. =ah is used to represents the generate signal a carrv and if a carrv is present
I.," I I I '- '- .J J
and the Pi =ai + hi is to propagate a carry if a carry position is generated and these are
the basic principal of the CLA algorithm. Using the new notations for the generate g i
and propagate Pi' a carry enters in position i + 1 and if a calTY is generated i or a carry
enters position i and is propagated through the calTY chain is found to be.
14
Ci+\ =gi + PiCi
Solving this equation recursively,
if we further extend the carry to one more position for a 4-bit adder,
(3.4)
(3.5)
(3.6)
In general, the above equations summarize the CLA pre-computing carry algorithm. The
equation for the carry into position i + r is given as,
(3.7)
To implement the above equation requires i + 1 gates, which have a maximum fan-in of
i + 1, and to produce all carries from Ci +1 to Ci+r requires a total gate count of,
~. 1) (11+3)11L)1 + =-'------'-
;=1 2
(3.8)
The set of equation 3.4 - 3.6 represent the 1~l level of abstraction of the adder because
the 'generate' and 'propagate' are individually obtained from equation 3.3 .
Theoretically. the CLA algorithm can extend the carry computation to any number of N
bits: however. due to fan-in limitations the practical implementation in hardware for the
CLA algorithm is usually limited to a fan-in of 4 and in 4-bits block modules. The first
level of abstraction is only lIseful for an adder with N less than 16-bits because as the
operand gets larger it bccomcs less efficicnt and incurs more delay. To implemcnt the
15
CLA with a larger operand, greater than 32-bits, the further expansion of the CLA is
required. If we consider a 2nd level of abstraction, then we can define 4-bit block CLA
'propagate' and 'generate' supper signals as,
(3.9)
(3.10)
where P3:O and g3:O are the 'propagate' and 'generate' signals when a 'propagate' is true
to propagate the carry if present and a carry is also generated, if a condition exists for a
carry. The supper P3:O and g3:O are similar to the 1st CLA's abstraction; however, the
supper P3:O and g 3:0 are the product of individual 'propagate' bits and the sum of product
of individual 'propagate' and 'generate' signals. From expressions 3.9 and 3.10, these are
not embedded in carry precomputing, equation 3,4 to 3.6, as a result the next carry of a 4-
bit CLA is simply reduced to,
(3.11)
As seen in equation 3.11, to compute the next carry only requires 3 inputs AOI (And Or
Invert) gate. This is one of the reason that makes larger operands more efficient and
incurring smaller delay. Interchangeably, the next carry often is expressed as,
C 1= G + PCo1+ I , (3.12)
where p, and G, are supper 'propagate' and 'generate' signals, respectively. Finally, the
sum is expressed in temlS of the previous carry and the propagate signal. In this case, it is
an individual propagate signal \vhich is computed from the input bits. a and b.
where ;r =a ffi b
, I
16
(3.13)
3.2 CLA Architectural description
Theoretically, the CLA algorithm is one of the fastest ways to pre-calculate and
predict the carry by using the inputs; however, there is a hidden penalty in input fan-out
and area as the number of bits begins to increase. When the number of input bits is
greater than 4, equation 3.6 cannot be easily extended because of fan-in and fan-out
limitations. As we observed in equations 3.4 to 3.6, the number of literals of each product
term increases by one with each successive equation, which in tum requires a gate with
fan-in of N+1. In practice, technology usually limits the fan-in of an input gate and
usually the number of inputs, which is available for a typical AND/OR gate, is limited to
3 or 4 due to the stacking NFET devices for an AND gate and PFET devices for an OR
gate implementation. For the reason described above, the CLA adder grows exponentially
when N is greater 4 and as a result, the CLA algorithm is best optimal at 4-bits due to
fan-in technology limitations.
The Carry Look Ahead Adder is selected for this implementation due to its
parallel carry computing and prediction and its relatively more regular structure as
compared to other implementations, such as the conditional sum adder, carry and save,
carry-skip adder, and various other architectures available for implementing the addition
algorithms. Also, it is important to keep the adder scalable to easily migrate to a new
technology. This can be done with relative ease with the CLA architecture. In a
straightforward implementation, directly from the above equations and close-form
expressions, we can derive the number of gates needed and predict the delay and the most
critical path of the adder. The adder is more regular, which can be implemented in a
modular 4-bit architecture and configured, either in parallel or serial, with several high-
level abstractions depending on the size of the operand. However, the optimized size is a
fan-in of 4 with three levels of abstraction and very little additional hardware required.
17
CLA: Second level of Abstraction
The 4-bit CLA adder can be implemented with a carry block in place of the 4-bit
generator that is separating the carry function which is previously embedded in the carry
Look ahead block. This CLA configuration reduces the critical carry path to a smaller
number of gate delays and the structure is much more regular due to the additional 4-bit
carry block that is used to generate the carry function. The equation (3.12), in chapter
3,describes the dedicated carry block to compute the carry externally and the feed to the
next stage of CLA.
(3.12)
The above equation can be easily implemented with a simple OR function. the carry in
simultaneously outside the carry generate block as shown in figure 3.4. This external
carry computation block is implemented directly using the equation and the block
implement the propagate signals using the below,
(3.10)
(3.9) and
where p is the propagate signal when a carry is generated and g is the carry generate if
there is a present of a carry. The carry of a 4-bit CAL can be expressed as,
(3.11)
(3.12)
and the sum is.
(3.13)
Figure 3./ shows a 4-bit CLA. The top part shows a block diagram of the 4-bit CLA and
the bottom shows a detail diagram of the adder.
IS
Figure 3.1: 4·bit carrY block
x<O:3>
y<O:3>
Cin
a<O:3> sum<O:4>
b<O:3> P*4
4·bit CLA Mod G*4
cin Cout
x<O:3> y<O:3> sum<O:3>
a<O:3 b<O:3> 5<0:3>
4·bit prop, gen & sum
Cin p<O:3> g4<O:3>
p<O:3 gb<O:3>
4·bit prop & gen
p4 g4
Cin
p4 g4
4·bit carry gen blk
cin coul Cout to next block
The 4-bit adder consists of a 4-bit reduced adder, which includes the propagate and
generate functions, and an external dedicated carry function. The 4-bit propagate,
generate and sum is composed of equations (3.9) and (3.10) above these are used to
compute in parallel potential the propagate and generate outputs from the inputs x and y.
The sum is computed by the reduced half adder. The significant of Figure 3. J is the
carry block since this block is a very simple function to implement and it only need
p~ and g~ in order to compute the output carry for the next stage. This reduces the
complexity and speed up the carry output significantly.
19
Figure 3.2: 16·bit CLA
a<O:15> sum<O:15>
b<O:15>
16·bit CLA Mod P*16
dn G*16
s<12:15> 5<0:3> caut
y<12:15> b<O:3> 4·bit CLA p*4 p3
x<12:15> a<O:3> dn g*4 g3
5<8:11> 5<0:3> caul ...I
y<8:11> b<O:3> 4·bit CLA p*4 p2 0"_.
x<8:11> a<0:3> cin g*4 g2
-I":l
6S'
~
5<4:7> 5<0:3> caul Ro
y<4:7> b<O:3> 4·bit CLA p*4 ~pI l'tl
x<4:7> a<0:3:> cin g*4 gl ::
5<0:3> 5<0:3> caul pO P*4
y<0:3> b<O:3> 4·bit CLA p*4 gO G*4
x<O:3> a<0:3> cin g*4
Figure 3.2 shows a l6-bit CLA diagrams of the l6-bit block and a detailed diagram of
the CLA.This l6-bit adder consists of four 4-bit CLA module as previously described,
Figure 3.1. Each of the 4-bit adder takes in two operands x<0:3> and y<0:3> and
produces a sum<0:3> and two bits propagate and generate P4and g4' The propagate and
generate outputs from each of the 4-bit adder are fed to the 4-bit propagate and generate
in order to produce 2 bit outputs supper P and G. These P and G bits will be fed to the
next level of the CLA in order to make 64-bit CLA. The output carry of each 4-bit adder
stage is serially fcd to the next stage 4-bit adder to complete the l6-bit CLA module.
20
Figure 3.3: 64·bit CLA
Caul64
caul
4·bit car
5<48:63> 5<32:47> gen
y<48:63> b<32:47> 16·bit CLA p*63 p4
x<48:63> a<32:47> cin g*63 g4
caul
4·blt car
5<32:47> s<32:47> gen
y<32:47> b<32:47> 16·bit CLA p*48 p4
x<32:47> a<32:47> dn g*48 g4
caul
4·bit car
5<16:31> 5<16:31> gen
y<16:31> b<16:31> 16·bit CLA p*32 p4
x<16:31> a<16:31> cin g*32 g4
caul
4·bit car
5<0:15> s<O:15> gen
y<O:15> b<O:15> 16·bitCLA p*16 p4
x<O:15> a<O:15> cin g*16 g4
Finally, the 64-bit adder is composed of four 16-bit modules as shown in Figure 3.3.
The (PI6' g16) , (P32' g32) ,(P.~8' g48) and (P64' g64) are generated by each of the 16-bit
module. These four set of propagate and generate are served as the inputs for the carry
block which computes the carry output and serially feeds to the next stage of the CLA.
The 64-bit CLA is configured similar to a ripple carry adder. It consists of 4 carry
function blocks that are used to simultaneously computes the carry outputs and feeding
back the carry in order to generate the sum outputs. \Vith a dedicated carry adder in the
architecture, it takes 476 gates with 2 levels of abstractions and has 13 gate delays
21
Chapter IV
Limited Switching Dynamic Style (LSD)
This chapter presents a detail description of the LSD (Limited Switching
Dynamic) logic style in terms of its functionality, circuit analysis. The lSI part is to
describe the LSD circuit and detail analysis and its advantages and disadvantages over
the conventional circuit style and its application. Delay has always been an important
parameter in circuit design therefore is discussed in this section and also transistor sizing
in Static and LSD CMOS circuits.
4.1 LSD descriptions and analysis
Sys elk
LSD ph1
LSD ph2
statc statc slatic
LSD Clock Pulse Generation
Figure 4.1: LSD clock generation scheme
A two-phase clock design is used for the LSD circuit and the CLA (Carry Look Ahead
Adder) however LSD logic style can also be used to speedup the data path by
interleaving between static. with no logic level translation needed. as a result the LSD
circuit can be used twice in a clock cycle provided the precharge and evaluate cycle are
relatively short and relative to the system clock which is typically generated using LCEs
(Local Clock Buffer generation). As shown in Figure 4.1 above. This interleaving
technique is often W'ed at lB~1 to reduce and speed up the critical paths at the cost of time
"
and efforts since LSD circuits requires more elaborate, design iteration and simulation
with more corners than the typical CMOS static circuits logic styles. Also, to interleave
LSD circuits between static also requires more hardware such as LBCs and clock pulse
generation in order to generate short pulse widths relative to the system clock, namely
phI and ph2 clock pulses. In chapter 2, a brief discussion
elk .....---r--_..(]
PO
NO
PI T
NI
~
N3
N2
)---+-11_ ab+c
f (,UJ,c)
OJlput~vn:EI
Figure 4.2: LSD AOI circuit configuration
between advantages and disadvantages of the conventional dynamic and domino circuits
due to clock sensitive and inability to latch in the value as a result the output will
constantly change as the dynamic nodes toggle regardless of the input value swings from
a high to low transition or staying constant for a long period of time this results to high
power consumption in the precharge and evaluation modes.
In Figure 4:2 is a typical LSD AOI circuit topology which consists of Nfet
devices to implement logic inputs of the AOI function. a Pfet and footed devices are used
for evaluation and precharge. In addition. LSD circuit also includes a latching
mechanism that consists of an inverter and a half latch similar to the conventional domino
circuits however the LSD has 4 extra devices, NI N2 N3 and PI . that are inserted
between the output inverter and input logic tree, these 4 devices act as a tri-state inverter,
which is used to latch in the value or to isolate the input and output stage, depending on
the feedback from the output during the clock is high or in evaluation phase. This so-
called tri-state inverter provides an isolation between the input dynamic node and the
output state, together with the output half-latch they act like a transparent latch which is
driven by the feedback value of the output. When the output makes a transition for low to
high, the feedback Nfet device will be turned on and the tri stated inverter operates like a
nonnal inverter thus latching the output value regardless of the precharge cycle. This
output value will not change until the next set of input logics being evaluated to high thus
the output will be pulled low and the keeper Pfet device will be activated to latch in this
value.
Since the output value is stored on the latch node which is based on the drain
junction capacitance value and gate input capacitance depending on the size of the tri-
state inverter. The output value is stored and get latched at this node therefore the output
value does not switch as often as the input value stays high or low for a over long period
of time, over more than one cycle, compared to the conventional dynamic or domino
logic families. The output only makes transition when clock is high and when value is
high the LSD circuits behave like the typical dynamic circuit, that is the output always
makes a monotonic transition similar to typical dynamic circuit, due to tri-state inverter
becomes active and the precharge value will propagate through and the feedback and the
"keeper", Pfet device P2. is turned off when the clock is high. In an evaluation mode,
the precharge Pfet is turned off depending on the input value being evaluated whether to
high or low. The two footed devices, connected to the tri-state inverter, are combined as
one if the input evaluation to a low then output stat does not change whenever the result
is high the circuit is evaluated and the dynamic node is pulled down by the footed device
and the dynamic node will discharge this has to transition at the output node from a high
to a low this in ten11S tum on the "keeper" together this half-latch will latch in the value
regardless the precharge value and the prechar~cycle until there is a
24
elk __----,----{
a---1
PO
N3
z
Figure 4.3: LSD inverter configuration
N2
change at the input logic and the input is evaluated to a logic high once again in order for
the output node to switch or change state. When the output stays low, in subsequent
precharge cycle the output will remain low due to Pfet keeper being turned on and the tri-
state inverter goes to high impedance state as a result the output stays until the logic
inputs evaluated to a high value thus forcing the output making a transition from high to
low.
Figure 4.3 shows a LSD inverter, unlike the conventional dynamic or domino
circuit, the output is completely latched by the tri-state and the half-latch structure
regardless of the input. The conventional dynamic circuit only latches the output when
the input stays low or in a precharge cycle since the dynamic left floating and being
pulled high by the precharge Pfet PO causing the output to go high then it gets latched
however the input stays high for more than one cycle the dynamic will toggle as a
precharge and evaluate cycle proceeds as a result
25
c1k ••~_---1
A --1
'I,nch' node
N3 DUlp",'wuler
N2
z
Figure 4.4: LSD inverter with c1k=O, regardles~ of input A, Z=1
the output will toggle continuously as a result the output follows the dynamic node so
long as the input stays high or evaluated to a high value. On top of the latching capability
of the LSD circuit, the LSD topology also provides a better noise immunity because the
tri-state inverter and the output inverter effectively isolate the static or latched node thus
it takes much more efforts to cause a false transition at the latchcd or static node. The tri-
state invcrtcr and the output invcrtcr can also be sized to provide a much bettcr noise
margin compare to that of conventional dynamic or domino logic family as it can bc sccn
in Figure 4.4.
The output is totally latchcd unlcss the inputs are making a transition. Whcn thc
input stays either high or low for a long pcriod of timc and thc output still stays latchcd
this helps the power consumption and also whcn a transition is madc. thc static nodc docs
not make a full I to I transition. LSD logic styles also havc thc ability to locally invcrt
the output at the expense of delay. area and some skcws between the complementary.
outputs. without resorting to dual
26
za--1
elk __-----tt-----f-08-I-cc-:!-C-2_rl1_21_s-_---II
NO
N3
.& N2
~)lJ1PL,: :,vcrlor
Figure 4.5: LSD inverter with elk= 1, Z=1
rail design, with an addition of a buffer unlike other dynamic/domino a full
complementary logic tree and additional inverter. The complementary LSD's outputs
inverters will need to be sized carefully in order to minimize the impact, these inverters
can also be sized to favor the rising or falIing edges accordingly.
Figures 4.4, 4.5 and 4.6 demonstrate how the LSD inverter behaves with different
value of input A. When the elk is low, the LSD inverter is in precharge mode, the output
Z always stays high due to Nfet NO is turned off and all other devices are active. Since
the output Z goes high, it turns on the Nfet N2 and N3 which complete a path to
activate the tri-state inverter as seen in Figure 4.4. The circuit will act like a half-latch as
. long as the output value remained high. In evaluation mode and the elk is high. NO and
Nt Nfets device become active. If the input value A is high or evaluated to a high. the
LSD inverter will make a transition from Figure 4.5 to Figure 4.6 and the feedback Pfct
dcvice. thc kccpcr. is tumcd on because the output Z making a transition from high to
low this in tum will latch in the output value. The tri-state invcrter stays active sincc
Nfcts dcvicc Nt providcs a completc path to ground for thc tri-state invcrter duc to the
circuit is in cvaluate mode. elk is high.
LP2
N3
a---1
elk ... -H--__N_':_c':_)t_c:_C\'_'c_e_rll_c:_ts_-_--l~NI
Figure 4.6: LSD inverter with elk=l, A=l and Z--G
z
In Figure 4.7 describes a test schematic consists of three different types of circuits,
dynamic, domino and LSD circuits.
Figure 4.7: DynfDom and L~d te~t circuits
28
INTENTIONAL SECOND EXPOSURE
c k
j ,~ \\ i k=] . .-\=] dlh,j /=(
~llid LSD Clt\.'Ull~.
Fi','lll\.' -+./: !)yn!!)oll1 1(."'1 L'irClli1S
As it can be seen in this schematic, a simple inverter of Dynamic, Domino and LSD
circuit is given and a ASX, similar to SPICE transistor simulator, simulation is performed
on these circuits with an input test sequence for the input A. With the Dynamic and
Domino inverter have a single output compared to the LSD circuit which can locally
generates the non-inverted output without affecting the internal loading of the circuit
since the internal is completely latched and isolated by the tri-state inverter. Figure 4.8
shows the output's wave forms of the test circuit presented in Figure 4.7, the output of the
LSD's2 LSD is
DynIDom outputs toggle Latched LSD's output
, ,
Figurc 4.8: LSD's im'cl1cr wayc-fonn outputs
:::.L- -+---' -'--'---~_~'____ _'_ c/
L- --! --'__~ ____'___ ~___1t TI"--~__
-e.et' -'------+----J.---J.__-'--'--'--'--.L.O>..,~'--J---..L.-1.-----'---'--L./_~"'i"""--.1.--'---L.­
I.U"
< . ,.-'------+--'-----'----------------,f--------""l"'-----
t.tli
LSD'~ D)"namic node
completely latchcd dcspitc the input valucs bcing toggle during the clock cycle. This
output only makes a transition when the input yaluc A stays low for at least one cycle.
29
Unlike the LSD inverter, the outputs of the Dynamic and Domino circuits are not latched
and keep toggling so long as the input value continues to stay high and the clock continue
to switch. The internal node of the LSD's circuit shows that it toggles just like the
Dynamic and Domino outputs however the stat _ jbk _fat node is a static node and
latches the value as soon as the LSD's output makes a transition from high to low
respectively.
4.2 LSD Clocking and feedback Keeper functionality
Clocking LSD circuits is different from clocking static latches or typical dynamic
domino circuits. There is still a synchronizing system clock, but the LSD stages
themselves utilize locally generated pulses whi'ch have been triggered from a transition of
the system clock. Figure 4.1 is a diagram that shows how LSD clock pulses relate to the
system clock. Note that the term system clock is analogous here to the global clock
referring to Figure 4.1 but in this LSD adder design and implementation the above clock
description is not used because IBM's proprietary methodology and confidential
materials at IBM.
The keeper prevents inadvertent switching due to charge sharing in non-output-
switching evaluation phases. Because all evaluation tree inputs must be stable, by design,
by the time the clock goes high (thus evaluating the stage), if the tree does not evaluate,
there is no charge transfer occurring in the tree (all internal nodes would havc already
bccn charged or discharged during the prechargc phasc). If the stagc docs cvaluatc,
dcsircd charge transfer occurs. This implies that if the inputs are not supposcd to affect a
positivc cvaluation, any internal nodes that would havc otherwise drained chargc from
.
the dynamic node (in a domino circuit) are already charged liy the time thc clock goes
high. thus. littlc or no chargc transfer takes placc in the LSD circuit during a non-
cvaluating cvaluation phase. The keeper aids in dynamic node chargc recovcry from
noisc cvcnts.
~o
It should also be noted that as the evaluation phase for each LSD stage is much
shorter than its associated precharge phase, there exists a lessened probability (compared
to "regular" dynamic circuits) that a noise event will be synchronous with the evaluation
phase to possibly cause a noise related failure. The keeper aids in charge maintenance
during long evaluation phase durations, as might exist during test or other diagnostic or
low frequency operating conditions, whereby current leakage through the evaluation tree
would otherwise drain charge from the dynamic node, resulting in inadvertent output
switching.
4.3 Transistor sizing for Static and Dynamic circuits
At IDM, there is a strict constraint in the microprocessor design processes and
methodology, in general the class of circuits allowed are very few, mostly Static however
with some exception in using other circuit styles including low threshold devices. The
SOl technology available at IDM only allows to stack 3 height for Nfet and 4 for Pfet
devices. In Static CMOS design, the beta ratio of the Nfet and Pfet devices is typically 3
and the Gain for a logic stage is 4 or beta p for an inverter stage is 3, beta being fined as
a Gain for a digital circuit such as inverter, Nand, Nor etc. The restrictions are primarily
due to and related to yield issues in SOl technology and to limit unnecessary power
consumption and most importantly of all is to maintain a consistency throughout the
microprocessor design community. The limitations and guidelines are strictly enforced
via CAD (Computer Aided Design) tools and tool checking available at IDM for any
potential violations in the design.
There are many different ways or methods to start the design from scratch, in this
case starting with transistor level schematic and finally translating the schematic to layout
or physical design after meeting all design specifications. However. it has not been
always easy to convert the transistor level to physical design and meeting timing
requirements because there are many factors that come in to play and the design is often
complicated and typically involves hundreds and hundreds of gates and complex wiring
scheme using different Ie"el of metal layers as a result it is most of the time it is not
31
practical to calculate transistor's sizes and estimate wire loading by hand and not to
mention it may several trial and errors to get all timing requirement to converge.
For this reason, The method for static transistor sizing starts with the circuit
driver's load and being scaled backward depending on the complication of the circuits.
The driver is often a set of inverter that is designed to drive heavy load, high fanouts or
long wire. Unusually, the beta ratio between stages is limited between 2.5 to 3 in order to
maintain a consistency and not to exceed the power consumption being allocated for the
design. Since transistor circuits is dependent on the availability of technology in terms of
transistor stacking, physical d~ns and wiring tracks. Table 4.1 provides a preliminary
way to estimate the capacitances of the Nfet and Pfet devices and metal layers to derive
transistors' sizes in order to meet timing in early stages of the designs.
Design Parameters Values Descriptiolls/commellts
Gate capacitance fF Minimum geometry,0.35- (*see note below)
f.lJ1l effective average
Source/Drain 0.19 fF SOl has little dependence on
capacitance f.lJ1l diffusion area
Driving into with gate
tied off
Metal I resistance 0.1210* L (0.8 J.1/1/ wide)
Metal 2 resistance 0.1060 *L (0.8 f.lJ1l wide)
Metal 3, 4 and 5 0.0820 *L (0.9 J.1/1/ wide) L= Length( f.lJ1I , drawn)
resistance Temp = 85°C
Mctal I capacitance 0.051fF *L
Metal 2 capacitance 0.048(F *L
Mctal 3 capacitancc 0.047 fF* L L= Length( J.1/1/ •drawn)
Assumcs minimum
width/min space and fully
populatcd abovc and bclow
Tahle 4.1: 18;\1 C;\10S dl'$ign parnmetl'rs
Even though, the table parameters are somewhat simplified but these parameters
provide a highly effective method of estimated timing and reduce the number iteration
,
that normally take to fine tune the designs thus cutting down design cycle. To deal with
device stacking for Nfet and Pfet, Table 4.2 summarizes the limits of SOl technology and
constraint of transistor level design. This table, Table 4.2, sets the limit for N and Pets
stacking and scaling factors for different stacking scenarios. The reason for limiting
stacking in SOl design is due to the charge leakage and other issues related yield. Unlike
bulk CMOS, SOl is highly vulnerable when it come to stacking the device N and as well
as P fetes devices more 4 devices high.
Stack height Effective Width
1 high nFET W
2 high nFET W
--
1.72
3 high nFET W
--
2.44
4 high nFET W
--
3.16
1 high pFET W
2 high pFET W
--
1.94
3 high pFET W
--
2.95
Tahlc 4.2: IBM CMOS dcsign guidclinc for stackcd dcviccs
Table 4.2 providcs a guidelinc for stacking Nand Pfct dcvice in diffcrcnt situations othcr
than just to sizc an invcrtcr. According to Tablc 4.2. thc actual width of thc devicc. W. is
divided by a scalc factor depending on the typc of thc dcvicc and how high thcy'rc being
stacked. This scaling factor is used to rcflcct thc actual perfonllance of the dcvicc
bccausc the higher the dcvices are stacked the large thc scale factor becomc thus
degrading the over all perfonllance of the circuit as a result Pfet can on be placed in
series of 3 high and Nfet device is 4 high. For stacking design such as Nand or Nor table
4.2 provides a scaling factors for sizing Nfet and Pfet devices since stacking device
require more special attention due threshold voltage variation and the floating body for
SOl technology in particular. For example, 3 ·Pfet device stacking high the width W is
divided by 2.95 and 3 stack high Nfet is scaled by 2.44. A stack 4 Pfet devices is
forbidden at ruM due to the limit of SOl technology at ruM and charge leakage and
hysteretic of the Pfet device. because of the effective resistance of both Nand Pfet
devices.
To illustrate a simple and relatively effective technique of sizing transistor, a set
of simple equations are given below. These equations are very simple and using the
design parameter provided in Table 4.1 and 4.2, together this technique works rather well
for initial sizing the transistor level design. It may take few iterations to finally have the
design specifications met. At ruM, after sizing the transistors in the design, an EinsTLT
is used to verify to whether all the timing specifications of the design are met. EinsTLT is
a static timing analysis, it analyses the design at transistor level to ensure there ale no
violations in terms of timing, slews and power are met. Equation (4.1) shows a
simplified expression for finding the size of Nfet transistor given the beta, gain of logic
stage and the output capacitances are known.
.. jetsize = C'oad (4.1)
N H(fJ+l)
where fJ is the ratio of Nfet and Pfet devices, and H = Cloud !Cin is the Gain of a logic
stage. CI,>"d is the total capacitance seen by the driver which includes wires, Cm " and the
input gate's capacitances of the receiving stage, Cin , in temlS of its width. The input stage
can be any type of static logic gate. The amount of capacitanccs that associate with thc
routing layer is callcd C~,.r which depends on the estimated length and the mctal layer.
The output capacitance for the driver is defincd as.
X
CI"",! = C''''r +L Ci•j
]=0
(4.2)
c;~ is the sum of total input gatc's capacitances in p.lIte/lim equivalence of a recciving
logic state in tenllS of the transistor's widths or C," = (p +11). IV indicates the number of
gates. From equation (4.1), the Gain can also be expressed as a function of the width of
the Nfet transistor as seen below
H = Cload (4.3)
fetsize N (fJ + 1)
The delay can be estimated using the below equation from the Logical Effort (LE)
technique [15]. Iz is the ratio of Cou)Cin ' P is the parasitic or an intrinsic capacitance
expressed in gate/u11l. Therefore, the delay for a logic stage can be estimated as
d=(lzg+p)r (4.4)
As it can be seen in equation (4.4), the Iz =Cload / Cin is the same as H in equation (4.3)
thus we can substitute H for Iz in equation (4.4). g is defined as the electrical gate effort
for certain type of logic gate which indicates the strength of the output current of the gate
being used in the design. Tau, r , is the delay that associates with certain technology
process for a fan out of 1. FO] is defined as a minimum size inverter driving an identical
copy, and the delay for a FO] for the technology implemented here is approximately
2.5ps. Substituting equation (4.3) into (4.4) leads to an expression for a delay stage,
Cd = [g load + p] *r
fetsize N (fJ +1)
(4.5)
For an example, an inverter driver of unknown size needs to be appropriately sized in
order to drive 6 inverters of size 6/3 Iml from a distance of 300u11l using minimum
width metal 1 then the size of the inverter that drives the 6 inverters can estimated as
follows,
Cin = (P+ll)
Cin =6 *(6 + 3)1011 =54J111l (Total of 6 inverters' input gates)
Since the length of the wire is 300u1Il with minimum spacing and the metal layer being
used is metal 1 from the table 4.1 the capacitance per Ium is O.054fF. The conversion
factor to gate/u1Il for metal I is O.3S,(F /um. Using the parameters in table 4.1. the size
of the drivcr can be calculated assuming fJ is 2 and the gain is 3. The C~,r( is cstimatcd
as.
C wire = 300pm*(0.051 IF) = 15.0 IFpm pm
Converting to gate micron,
_ IF * J111l _
Cll'ire -15.0- - 43.7 J111l
J111l 0.35IF
C load = C"'ire + C in = 43.7 J111l +54J111l = 97J111l
Using equation (4.1), the width of the Nfet can be calculated as,
fetsize = Cload ,
N H(P+1)
since the P is 2 thus the Pfet size is,
. 97u11lfetslze N = == 10.8um
3(2 + 1)
Pfet = fJ(Nfet) = 2(1O.8U11l) = 22.6J111l
The electrical effort for the inverter is 1 and ignoring p, the delay is
Cd = [g load + p]r
fetsize N (13 + 1)
97U11l[(1) +0](2.5ps) = 7.5ps
10.8(2 + 1)U11l
Using the process described above, one can estimate the sizes of the static gates and the
amount of delay that associate with certain logic stage. Calculating parasitic
p capacitance for Nfet and Pfet devices is relatively involved and does not significantly
improve the accuracy therefore it can be ignored. To account for the parasitic capacitance
p, we can add 5% to 15% margin into the design.
36
Figure 4.9 A Static CMOS inverter driving 6 inverters
The set of equation above can be used to estimate the size for any transistor sizes, it may
takes few iterations to have timing and slews to converge and the most importantly it
provides a quick way to estimate the sizes of the transistor in the design without
involving complex equations. With the simplicity these equations, these can be
programmed to quickly do the conversion and calculate all the sizes for more
compI icated transistor structures.
Dynamic logic has become increasingly important in high density VLSI design
because of compatibility with CMOS fabrication and relatively easy to integra.te into the
CMOS design process bccausc of its arca cfficicnt rcquircmcnt and delay and powcr
pcrfonnancc characteristics howcvcr in ordcr for thc dynamic logic circuit to optimally
pcrfonnance thc circuit has to bc fine tunc for dclay. powcr and perfonnancc. Thc
procedurc to be uscd in this is to itcratc and scalc the according starting with footed
devicc and tapering the Nfet dcviccs as shown in Figure 4./0. Dynamic logic has hecomc
increasingly important in high density VLSI design becausc of compatihility with C~tOS
r
. I
INTENTiONAL SECOND EXPOSURE
::i'(liC' eli1 hc [hCel t(l C.~tillllltC thc silc for lll1V trall~i~t(lr ~j/c~. it Illll\
. .
C\lOS
/
,
, (
-.---J
fabrication and relatively easy to integrate into the CMOS design process because of its
area efficient requirement and delay and power performance characteristics however in
order for the dynamic logic circuit to optimally performance the circuit has to be fine
tune for delay, power and performance. The procedure to be used in this is to iterate and
scale the according starting with footed device and tapering the Nfet devices as shown in
Figure 4.10.
Figurc 4.10: LSD AOI Nfct dcvicc sizing
Dynamic circuits sizing is very much different to static circuit in tenns of transistor level
because dynamic circuits involve Nfet stacking dcvice and a Pfet load dcvicc as a load
and not to mention the Nfet devices have to be charged and discharged periodically. In a
way dynamic circuit is similar to Pseudo Nmos circuit however the Pret device has to
sized adequately in order to hold enough charges during precharge cycle. If the Pfet load
device is sized too large then it would take much longer to discharge and this would not
be able to operatc at the desired frequency. And also if the Nfet devices are too large in
38
...
I
INTENTIONAL SECOND EXPOSURE
r
Li\ ,illel ['\ill ~'I 1'~'l"r()1"11LII1C'~' ~'!1dl'dLll'l"i'II~' !1"11
Figllre -l.IO: LSD .\01 "I'd <in ict' ~ifin!!.
] ) 1'-, "1/1 Ij ~ I"
111,1
\ ; ! I
['
terms of width and the Pfet load device is too small the circuit would not be able to hold
enough charges for a whole clock cycle.
In general, according to [13], the Nfet chain device need to be scaled in a tapering
fashion to minimize the delay. For an AND or NOR gates, the series Nfet devices are
scaled starts with the bottom or footed device with a fixed size Wand progressively
scaled down smaller and smaller size until the last Nfet is reached as shown in Figure
4.10. The device M1 has width of 5 and the last Nfet device is 2.4. Scaling CMOS AOI
dynamic circuit is similar to static CMOS transistor sizing procedure. Sizing Dynamic
circuit also starts with the capacitance load that circuit is designed to drive and to meet
the slews and timing criteria or constrains the same way as static CMOS. The circuit
starts with a minimum sizes device and being scaled according to the load that it is
driving. Because of the nature of Domino logic circuit thus the transistors are
progressively sized smaller as closer to the dynamic node to reduce this node capacitance
and also to make the layout more area efficient. The set of schematic starts with some
minimum values as described above and the schematic is fed into a tool called Einstuner
which is used to tune Dynamic circuits. This procedure may make few trials and errors
before the design converge to meet design specifications. Due the complexity of the
deriving for scaling the dynamic circuits therefore the equations are not included in this
papers except some basic MOSFET's equations.
39
Chapter V
Adder Designs and Implementations
This chapter describes the procedures that implement the adders in terms of using
ffiM's and Cadence's VLSI design tools. Section 1 presents the designs of the two adder,
static and LSD, in terms of implementing the architecture basing on the Boolean
expression that were derived in the previous, the essential equations are included for
reference purposes and taking into an account the number of gate to implement the static
designs. Part 2 describes the physical designs aspects of the adders and the simplified
physical design flow at ffiM using internal and vendor VLSI design tools. Also included
is the design parameters in the provided table, device's parameter in terms of gate, metal
layers, metal thickness and the availability of the metal for routability also pitch and
metal width. These parameters are extremely useful in the early stage of the design using
these model the delay and timing of the circuit, macros, can be estimated. The designs
will be iterated until macros meet all design criteria.
5.1 LSD 64-bit and Static CLA adder designs
The implementation of the adder is based on second level abstraction architecture
described in chapter 3 which using the below expressions.
P,:o = P,P2PdJo
g,:o = g, + P,g2 + P,P2gt + P,P2!J1go
(5.1)
(5.2)
(5.3)
(5.4)
Expressions (5.1- 5.2) fonn a block which generates the supper propagate and generate
signals called P,:o and g .1{) or P; and Gi . The equations (5.3) is implemented as a
dedicated carry block which inputs arc the generate and propagate. Pi and gi' signals
from the individual bit Xi and Yi
40
The 64-bit static and LSD CMOS adders are designed using the Boolean
expressions described above. To implement the static CLA is relative straight forward,
the static adder is directly implemented using the equations described in chapter 3 and 5.
These equations are implemented using CMOS static standard library gates which
primarily consist of NAND, NOR, XNOR and INVERTER gates therefore the final
designs have to be optimized in order to reduce delay and power consumption.
For the LSD adder is however more difficult since the equations expressed in
terms of gate level and the LSD is a dynamic circuit logic style in nature and only
monotonically evaluate the inputs therefore if an expression that involves an XOR which
requires dual polarity from inputs. The LSD requires a non-inversion output in order to
accommodate this polarity different. This inverted output can be easily implemented in
the LSD logic style.
5.2 Physical design tools and CLA adder layouts
The physical design is an important part of the IC design process, there are many
iteration which have to be done in order to finalize the design since the schematic
captures most often do not necessarily include all the details and all the intrinsic and
parasitic capacitances of the devices and on top of that the lump RC delays model is
relative accurate a short distance however these are notoriously bad at a long interconnect
wire. Usually. the lump and PI-RC model are only good a for certain optimal length of
wire, besides accurately model and reflect all the modeling detail and delay in the
schematic capture quite often take or at least require much more time, a lot of time to see
the least. as a result it is often requires several iteration to get the design working
properly.
41
schematics trans sizing
simulation, and static timing
no
Physical design
layouts
Figure 5.1 flowchart for schematic transistor design
Figure 5. J shows a simplified chart for a schematic transistor design process at IBM
which typically requires many iterations depending on the initially estimated sizes of the
transistors. The initial sizes of the transistors are usually estimated and calculated using
the techniques described in chapter 4. The calculations take into account all capacitance
loading including wiring tracks and layers that the transistors are intended to drive.
The wirings could be between metal I to 3 and track is the width of the wire,
which can be minimum, double or triple wide. After the transistors are approximately
sized then the schematic netlist is extracted, simulated and finally taken through the static
timing analysis tool called EinsTLT. Since every design has specifications and the
EinsTLT is used to check to see whether the design is to meet the desired spccifications.
this is a prelayout schcmatic design phasc. If the schematic design mcets all thc dcsired
critcria thcn the design is translated to a physical layout of all the transistors using
Cadcncc's Virtuoso XL physical design tool. This Virtuoso is used to do place and routc
thc dcsign and all thc necessary chccks such as DRe. LVS. and ~IETH arc perfonllcd
42
using this tool. The finally step is to extract the layout and the design once again to be
simulated and static timing check is performed to ensure the design meeting all the
specifications however the whole process is repeated again if the design does not meet
the desired specifications.
Figure 5.2 depicts a flow chart that describes physical design process at IBM
using Cadence computer aided design tools, using Virtuoso. The Virtuoso XL is a custom
place and route computer aided design tool available at IBM physical design
environment, this helps to speed up the physical design process. Virtuoso XL is proven to
be extremely helpful in translating schematic to physical designs even though it may take
several iteration to get the design converge in terms of timing. Conventionally,
converting schematic circuit to a physical design (layouts) is a manual laboring process
however with the design becomes more and more complex and with thousand and
thousand of transistor being put in small and smaller area this manual design is no longer
feasible it is not only the design itself but also when the design need to be modified to
include new logic or incorporate new feature in the designs. Even though, this process
seems similar to synthesis but it is not a synthesis process because the design who still
full control over the design unlike the automatic synthesis. Using Virtuoso XL, the
designer has the ability to optimize in terms of placing and routing the design.
4J
!
open schematic and
create.new layouts
1
creating instant and pin
"Generate from source"
1
1IManual placement I Virtuoso customplacement
Manual routing, signals I Virtuoso Custom router Iconnections, physicaldesigns
F
~IDRCand LVS
Figure 5.2: CMOS physical design methodology flow
The end product is not as good and optimal as full custom design but the trade off is time
saving and this saving is proven very critical, in genera1ly the design cycle is getting
shorter and shorter in the VLSI business. Custom design, in practical tenl1S, means
control oyer the circuit styles. topology. device sizes and the physical designs of both
transistors and interconnects. By manually-design the macro. the circuit can be optimized
and minimizing the parasitic capacitances and the number stages. logic gate level in order
to minimize delav vs. the ASICs or ce1l based design. synthesis.
- ~ -
5.3 IBM's CMOS8S0 SOl technology
At mM the new technology which is currently available at mM is the SOl (Silicon On
Insulation) is the partially depleted as opposed to the fully-depleted since the partially
depleted is much more stable and more closely comparable to the conventional CMOS
.
process with channel length of 0.08 and gate oxide thickness 10 Angstrom ( A). The
CMOS8S0 FET is Partially Depleted (PO) SOl devices. In a PD-SOI FET, the body of
the device is not fully depleted of charge carriers at a gate bias equal to the threshold
voltage.
Gate L(JJ 0.09JIJ1l
Gate oxide 2.31l11l
Metal layers width thickness
MI 0.5J.111l 0.3J.111l
M2 0.63J.1111 0.3J.111l
M3-M5 0.63J.111l 0.42J.1111
M6(MQ) 1.26J.111l 0.92J.1111
M7(ML) 1.26J.1l1l 0.92J.1111
Dielectric £, - 4.2
V 1.6Vd.i
Tahle S.tTlle IBM Cl\lOS~~O fcatnrcs
This type of design is considered to ha\'e rnanufacturability ad\'antages sine the low bias
drain bias Vt does not dcpcnd on the de\'ice silicon layer thickness or thc amount of oxide
chargc in the buried oxide (BOX). Since the quasi region exists when the dc\'icc is on.
45
charge can accumulation can occur in the electrically isolated body, this leads to some
unique effects. The Nfet and Pfet extension, halo and source or drain implant have been
optimized to reduce the floating body effects inherent in sal FETs. The magnitude of
the floating body effect has been reduced by increasing the junction leakage significantly.
The major different between Silicon On Insulation (Sal) and Bulk Silicon is that the sal
devices are fully oxide isolated. Individual device region define by the RX mask level are
isolated laterally by the shallow trench oxide and from the substrate below by the buried
oxide (BOX). This isolation improvement results in a number of features and effects
which are unique to SOl technology. N+IP+ and well spacing constraints due to latch up
are relaxed due to the improved isolation. The improved circuits performance provide by
sal technology is partially due to a reduction in junction capacitance
The technology being used in this project to implement the adders is O.18J.1111
CMOS CMOS8S0 sal technology at IBM, as shown in Table 5.1. This is the fifth
generation of SOl technology being used to fabricate state of the art microprocessor at
IBM, it features seven level of copper wiring layers, with a special level of slightly doped
which is used for standard cell connection on besides the metal one layer for
interconnection within the standard cell. This special layer provides an extra layer and
proven extremely useful and area saving compare to that of other process since standard
and memory cells are being used in many places and repeated so many times in the
chip/microproccssor therefore it is imperative to make thcse cell as small as possible in
order to rcduce power consumption and make the dcsign as compact and optimal as
possible thus eliminating thc nccd for using mctal 2 to conncct signals for a complex ccll
such as XORlXNOR. Table 5.1 summarizes somc basic featurcs of the CMOS8S0 SOl
technology paramctcrs such as thc effcctivc gate length and gatc thc gatc oxidc thickncss
and sc\'cral metal width and thickness of the mctal routing lcvcl.
46
5.4 Design checking in physical design
Design checking at the transistor level provides the designers with a degree of
freedom not allowed in a cell-based methodology. However this comes with a price,
expanding the design space to include additional circuit topologies and families demands
additional in rigor checking so that improper design circuits do not find their way into
manufacturing. The definition of "best design practices" is subjective and depends on
many factors, class of circuits. In general is a set of guidelines to limit the range of
possible legal and illegal circuits. Einscheck provides a mean for discriminating legal and
illegal circuits constructs and device sizes and perform a basic checks on electrical
relationship such as noise margin, capacitive coupling and interconnects current carrying
capacity. The checking strategy was instrumental in ensuring that the final chip could be
assembled with a minimum of problems. Because of the size of the chip, a large number
of problems at the end of the design cycle would be too difficult to detect and fix.
r. Therefore checking methodology was developed to treat each unit and the core as a "chip
let". In addition, a robust set of methodology checks were developed to ensure that all
macros, units, and the core could be correctly integrated at the next level of hierarchy.
In order to check an entity as an chiplet, it is necessary to understand the
environment in which the chiplet resides in the chip. To model this environment, the
cover (routing contract with the parent) and the parent cover (fixed chip level
infrastructure) were added to the unit for DRC and LVS verification. Since the cover
contains blockage layers, as opposed to manufacturable shapes to the blockage of the
same layers, a separate set of checks were included. No minimum area checks were done
on blockage shapes, since they are not rcquircd to comply with the arca rules. A separate
mcthodology for DRC dcck was created to chcck add ional dcsign constraints abovc and
bcyond dcsign rules nccessary for manufacturing spccified in the dcsign rules. Thcse
chccks conccntratcd on cnsuring the quality of the blocks as we1l as their ability to be
intcgratcd at the ncxt level of hicrarchy. The typcs of dcsign constraints chcckcd includcd
cnsuring a1l manufacturablc shapes are "onc-half ground rulc" for thc boundary floor-
plan block. power buscs arc onc the corrcct periodicity. and clock pins are in the correct
47
tracks. Addition checks maintained the quality of the design for routing, for example pins
are checked to ensure that they are on grid and accessible from the same layer or the layer
above.
The final step in physical checking is to extract the design to get a representative
netlist that includes resistance an~ capacitive coupling to a net, in some cases the nets
also coupled with noise depending the layer of the metal that the nets are connected to.
With this extracted netlist, all the checks are performed again including static timing
analysis and simulation to finally verify the design to whether all the specifications are
met.
48
Chapter VI
Adder Evaluations
This chapter is the over all analysis of the performance of the two adder which are
constructed with two different circuit styles and topologies, one is conventional static and
the other is LSD logic style. The adders will evaluated on the power consumption,
performance and area and ease of technology mapping point of views. For the power
consumption, each adder block is simulated using Power Spice which is similar to the
regular Spice circuit simulator to extract the power consumption and then all of the block
are put together and the values of the power consumption estimated since using equations
described in the previous sections also the model being used for power simulation. As for
the performance, the adders will be evaluated also using PowerSpice for its clock rate and
capacitance load which is designed to drive and other various aspects of the designs.
In a convention dynamic or domino circuit styles, which consists of precharge
and evaluate cycles. The clocking scheme typically employed single or multiple phases
depending on application. In a dynamic topology, the circuit can only implement
inverting function that is the signal can only transition from high to low versus domino
circuit is applicable for implementing non-inverting function which can only changes
from low to high. Since either of these circuits styles, dynamic or domino, consist any
latch or storage clement as a result the charge and precharge cycle continuously charge
and discharging the circuit therefore consuming extra power to refresh the value even the
output value doesn't chal1ge.
6.1 Static adder pcrformance cvaluations
Static Circuit PO\I'C1" Consumption: Power consumption appears to be primarily related
to clock switching at the latches. The total energy consumed in the first seven cycles
(0.394I1s to 5.997ns) is about 6.683mWns. which includes all of the functional switching
for the simulation. The total energy consumed during the next se\"Cn cycles (5.998I1s-
11.6oons). during which no functional switching takes place (but the clock is still
49
switching), is 4.389mWns. The total energy consumption for a cycle in which clocking is
the only switching going on was 0.627mWns. The total energy consumption for the first
half of this simulation (0 12.5ns) which includes all of the simulation's functional
switching and some time of no functional switching was 11.972mWns. The total energy
consumption for this simulation 25ns) is 21.731mWns.
6.2 LSD adder performance evaluations
SOl technology has been touted over the years as a great process for reduced
power consumption due to the reduction of total capacitance on the device as well the
chip. As shown in equation 6. J, it shows the relation between power (P), capacitance (C),
voltage (V), frequency (f) and the switching activity. The power consumption savings
come primarily via
(6.1)
two opportunities. The first is the total capacitance has been diminished due to the
reduction of the diffusion capacitance. Since the junction capacitance is not only
capacitance on the chip, this results in approximation of 10% to 15% reduction power
savings compared to bulk process. Second, With the perfornlance advantages possible
with SOl technology it is possible to operate the circuits at the same frequency however
with a lower voltage since the power consumption depends non-linearly with Vd~ this
resulting to a much larger in power saving. Makes it possible to have a larger saving in
power consumption comparing to bulk CMOS. Comparing power consumption between
static and LSD circuits is also somewhat difficult because common patterns don't
exercise the two circuit families' worst case conditions. and there exist differcnt means by
which powcr is consumed in the two types of circuits.
In a convention dynamic or domino circuit styles which consists of prccharge and
c\'aluatc cycles. The clocking schcmc typically cmploycd singlc or multiplc phascs
dcpending on application. In a dynamic topology. thc circuit can only implemcnt
invcrting function that is thc signal can only tramition from high to low vcrsus domino
so
circuit is applicable for implementing non-inverting function which can only changes
from "low to high. Since either of these circuits styles, dynamic or domino, consist any
latch or storage element as a result the charge and precharge cycle continuously charge
and discharging the circuit therefore consuming extra power to refresh the value even the
output value doesn't change.
LSD Circuit Power Consumption: There was a comparable amount of functional
switching going on during this simulation as compared to that for the static, although it
was carried out over a longer period of time. The total energy consumption for the first
half of this simulation (0 to 12.5ns) which includes all of the simulation's functional
switching and some time of no functional switching was 6.002mW. The total energy
consumption for this simulation (25ns) is 17.113mW. The fact that the LSD circuits do
consume dynamic power under certain data input conditions seems to be offset by the
fact that there is less clock-controlled device width required than in the static equivalent.
As far as noise is concerned, several experiments were performed to study the
effects of unwanted signal coupling onto LSD input signals and the overall impression is
that a very significant and unusual condition must exist for noise coupling to cause a
functional problem. All the noise simulations were performed using the "Noise"
simulation corner provided in the ASX environment, which is believed forces the worst
case noise condition. Other conditions may exist where coupling onto the dynamic node
or' some other node local in the LSD stage may occur, but typically vigilant physical
design methods should eliminate these extreme cases. If the dynamic node connection is
not made longer than 500um long then this would significantly reduce the LSD circuit to
falsely switch.
Generally speaking. power consumption III static circuits IS dominated by
transistor .gate switching. internal gate and diffusion nodes switching during functional
evaluation. and all the interconnecting wire capacitance switching. (For this discussion.
power consumption due to leakage will not be considered.) In the LSD circuit family. in
addition to all of these intrinsic means of power consumption listed for static circuits.
51
there also exists power consumption related to the dynamic clocking of the circuits.
Specifically, depending on the inputs to the evaluation tree, the "dynamic node" will
recharge (from gnd to vdd) after each cycle that the evaluation tree fully dissipates the
dynamic node charge to gnd. Also, nodes internal to the evaluation tree that were part of
a complete discharge path from the dynamic node to gnd can charge and discharge with
or without the stage's output switching. These two types of power consumption have
been referred to as secondary or implied clocking power. They are solely data dependent,
and under worst-case conditions, can account for a significant portion of the total power
consumed in a design. If the inputs are held low cycle after cycle, there will exist a
constant path of conduction from the dynamic node to the node just above the evaluation
tree's foot device. Even though that stage's output will not switch from cycle to cycle, the
dynamic node and all the internal nodes along the path of the input nfets will charge
during each precharge phase and discharge during each subsequent evaluation phase,
cycle after cycle (again, with no functional output switching even happening).
Dcpending on the worst-case secondary clocking power consumed in the logic, it
secms unlikely that the overall power consumption of an LSD design would surpass that
of a static cquivalent. However, as the LSD design will likely be smaller, power density
problems could arise. This could be mitigated by introducing decoupling capacitors in
the design, but this would offset the area benefit to some degree.
Table 6.1 summarizes the rcsults of the two adders, the physical layout area of the
LSD's adder is approximately one fourth the size of the Static one, this is due to the size
of the LSD circuits and the reduction of using Pfet devices since Dynamic circuit is
mostly consists of Nfet dcvices and thcy are relatively smaller size compare to Static
CMOS circuit. The power consumption of the LSD adder is roughly 20% less than the
Static but this power reduction of the LSD adder does not reflect the fact that it is one-
fourth the size of the Static adder. this is because the constant switching nature of
Dynamic circuit. Both of these adders arc operatcd at -2oops cyclc and the adders'
outputs are mcasurcd using EinsTLT which is a transistor le\"e1 static timing analysis
internal tool at IBM.
52
Type of adder Number Dimension Area Power Output Cycle
/---
of gates (j.1m) (j.1m) 2 (mW) valid (ps)
(ps)
Static CMOS 476 330x675 222000 626 123ps -400
LSD CMOS nJa 90x190 51000 492 90ps -400
Table 6.1: Comparison between Static and LSD adders
Included are the layouts of both adders with routing metals being turned of for easy
viewing and the heretical of both adders are maintained between schematic and physical
layouts.
64-bit Static CMOS CLA Area 210000, 400x500
53
INTENTIONAL SECOND EXPOSURE
TYPl' or addcr ",umber Dimcnsion Area Power i Output ('yelc
or gates ( ,ifill I I ,LIlli I" lill W ) \alid ( fJ I I
( jll)
Sl~l\ll' C\ lOS 47h .~~()\h75 222()()() 62(1 , ]2~I'S -41)()
LSD C\!OS ,I \)()\Il)() 51 ()()() -+92 I ()()ps --+( )()! . ,
Tahle h.l: ('oI11Jlarison bl'lnTell Stalic and LSD adders
Included ~lre till' Lt\(lUlS (11 blllh ddders \\ith rllulin." IllCl~tls hein." lumed llr I'llI' e~I"\
\ie\\ln~ ,Iild the hL'r('lll'~l1 III hlllh ddders ~lre 1ll11inlaincd hCl\\een SChCllldliL' and ]lh~siLll!
1~1\'1LltS,
6-+-hil S[dlic C\lOS CL'\ ArCd 21 ()()()(). -+()(h5()()
64-bit LSD CMOS CLA Area 51000, 90x 190
INTENTIONAL SECOND EXPOSURE
h-i-hit LSD C\]OS CL\ .\rc~l ) I ()flO. <)()\] <)()
Chapter VII
Conclusions and future works
In an attempt to solve some of problems of the conventional dynamic and domino
logic styles, this thesis presented a new logic dynamic circuit style which is called LSD
(Limited Switching Dynamic) due to its latching capability and also included the design
and implementation of the CLA (Carry Look Ahead) adder. In the process of design and
implement the two adder, various internal tools are ffiM were use to, at the very least,
preprocessing of design data for the Verity, EinsTLT, and Echk tools via SPAM (ffiM's
Subcircuit Pattern Matching tool) would have to be taught a variety of things (e.g.,
electrical behavior) with regard to LSD. Initial attempts at running Verity on LSD
circuits failed, apparently due to Verity's inadequate comprehension of the net list
topography.
The two adder designs in this experiment performed relatively well however the
LSD circuits are much more sensitive to wire loading, clock skews and data setup time
thus in design all the dynamic stages have to be careful in terms of device sizing, and the
driver of the output stages. Also, according to the ffiM's CMOS8s0 technology manual
states that in the interest of accurate model to hardware correlation, FETs devices' widths
should not be smaller than 2f.lJll and avoid long wire as a direct gate's input. As far as the
physical design for the adder is concerned, it's much more challenging to layout and wire
the LSD adder compare to that of Static adder because differential inputs and outputs that
require for some the LSD stages and also LSD circuit is more sensitive to capacitive
loading, data and clock skewed than Static circuit. For these reasons, designing with LSD
circuit takes a lot more efforts. time. careful planning and the most important of all the
design has to rigorously check and test if it is to be fabricated into real Silicon.
Since the growth of VLSI's ending is not in sight for a long time and the
microprocessor will only get more and morc complcx with forever packing more feature.
transistor count. powcr consumption and increasing speed thercfore it only make sense to
include morc and more Dynamic and other logic circuit styles in order to alleyiate some
55
of the problems and speed up the critical paths, especially data oriented paths in highly
number crunching applications. Designing VLSl using CMOS SOl technology is some
more different than Bulk CMOS because SOl is relatively new comer in VLSl arena and
the process is not completely well understood. On top of that, Dynamic circuits being
fabricated on CMOS SOl is even more difficult to get good yields and performance due
to Dynamic circuit model in SOl does correlate well between the model and the hardware
however as the SOl technology becomes more mature and the modeling well understood,
there is no doubt Dynamic circuits will be an important parti~I. Despite some of
the problems that Dynamic circuit posses, it still is very rrtuch inclusive in the ffiM's
microprocessor design community where dynamic circuit serves as the last resort solution
to remedy some of the area, power consumption and speed. Presently, Dynamic circuit
has gained more acceptance and that does not expect to change in anytime near in the
future.
56
Appendix A
Static schematics and layouts
This appendix includes the basic schematics of the static since the
implementation of the static is directly based on the Boolean expression thus it relatively
straight forward to understand the functionality of all the blocks of the CLA. A top level
schematics of the 64-bit CLA and the Reduced full adder are included for referencing
purposes. An equivalent physical layout of the adder is provided, with all the metal levels
are turned off.
Schematic of Static 64-bit CLA
INTENTIONAL SECOND EXP SURE
lin
I" ic.
Schematic of Static 4-bit RA (Reduced Adder)
Layout of Static 64-bit CLA
58
INTENTIONAL SECOND EXPOSURE
SChCllldlic ul' Stdtic -1--hit RA (Reduced Adder!
Appendix B
LSD schematics and layouts
Schematic of LSD 64-bit CLA
Physicallayout of LSD 64-bit CLA
59
INTENTIONAL SECOND EXPOSURE
Appendix B
LSD schematics and layouts
Schematic of LSD 64-bit CLA
Phy~jca] layout of LSD 64-hit CLA
Schematic of LSD 16-bit CLA
Schcmatic of LSD 4-bit gcn&grop and sum
60
INTENTIONAL SECOND EXPOSURE
Schematic or LSD 16-bit elA
Schematic of LSD 4-bit Gen&Prop, Carry Look Ahead
Schematic of LSD 2-bit Supper Gen&Prop
61
INTENTIONAL SECOND EXPOSURE
Schcmatic or LSD 4-bit Gcn&Prop. Carry Look Ahcad
SChCIll~ltic ur L~SD 2-hit Suppcr Gcn&Pn1p
hi
Schematic ofLSD Supper Carry Gell Blk
62
INTENTIONAL SECOND EXPOSURE
\(!I0.lIiI!i(' (}/l.\f) SUfi/)i'}" Cill'1'\' Gel! R!k
Appendix C
SOl Partially-Depleted Simplified Drain Equations
The following MOSFET's equations are extracted from the BSIMPD models, it
shows a simplified drain equation and its parameters. Unlike Bulk CMOS, the drain is
much more complex and depends on many factors due the nature of SOl technology.
I Idso 1 Vds - Vdseff
ds,MOSFET = R I (+ V )
1+ ds dsO A
Vdseff
pv V (l-A Vdseff Jf;steff dseff bulk 2(V 2)
f;sreff + VII dsO = ----....:-.-----...:...--:;,..--~
Vdseff1+----=--
ESQILeff
Vrh ~ the Threshold voltage
Abu1k ~ Bulk charge factor
Vdscff ~ the effective drain voltage
Vp/(ff ~ the effective gate over-drive voltage
R ~ Source and Drain series ResistancedJ
f.1cff ~ the effective electron mobility
£,." ~ the critical Electrical field where carrier velocity become saturated
VA ~ the Early voltage which accounts for channel length modulation
63
Appendix D
Static and Dynamic Power Calculations
Included below is the PowerCal panel which is used at mM to estimate the power
consumption for either Schematic or Physical layout of a macro or block. The power
estimation algorithm is based on the formulas that were described in chapter II section
2.3. In reality, the algorithms that are used in power estimation tool is much more
complex than the expressions that were given in section chapter II.
Interactive or Batch Mode .. Interactive v Batch
Ubrary Name 1 _
INTERACTIVE :
Power Info Handling Mode .. calculate v Query v Add v Update
If In QUery mode, descend to leaf cens? .. NO v YES
cell "arne II
cell View II
StaUc area'r. Ia.
Dynamic area'r. Ia.
Large array area'r. Ia.
SmaDarray area % Ia.
CPl area % Ia.
Oock Frequency ( x16Hz)
PuwcrDlmsity {W!mm A 2}
Puwcr{W}
BATCH :
Input File
output Rle
Switching Factor % /2a. Power Density (Wlmm"'Z) 10. 3~
SwItching Factor 'r.1"'-1-0-'0.- Power Density (Wlmm"'Z) 111
Switching Factor'r. 1100. Power Density (Wlmm"'Z) I~O-.':::::~--
Switching Factor % /100. Power Density (Wlmm"'Z) 10.97r
Switching Factor % 12a. Power Density (Wlmm"'Z) 10. ~
I~
I/tmp/Powerrn.tx~
I/tnp/PowerOut.tx~
References
[I) T. Haniotakis, Y. Tsiatouhas, "Novel Domino Logic Designs," IEEE Journal of Solid State
Circuits, voI.32,no. 2, pp213-215,1999
[2) J. Yuan, C. Svensson, "High-Speed CMOS Techniques," IEEE Journal of Solid-State
Circuits, vo1.24, no. I, pp2-70,1989
[3] K. Eshraghian, "Principle of CMOS VLSI design, A system Perspective", Addison-Wesley
Publishing Company, 1988
[4] S Hwang, A. Fishser "Ultra 32-bit CMOS adder in Multiple-Outputs Domino Logic", IEEE
Journal of Solid-State Circuits, vo1.24, no. 2, pp358-368,1989
[5] V. Friedman and S Liu, "Dynamic CMOS Logic Circuits," IEEE JSSC, voI.SC-19, no.2,
pp.263-266,April 1984
[6] C. R. Tretz, W. Reohr "Ratio CMOS: Low Power High-Speed Design Choice in SOl
Technologies" IEEE SOl Conference, pp28-29,2000
[7] Z. Wang, G. A. Julien "Fast Adders Using Enhanced Multiple-output Logic", IEEE Journal
of Solid-State Circuits vol, 32, No 2, pp. 206-214, 1997
[8) R. H. Krambeck, "High-Speed Compact Circuits with CMOS", IEEE Journal of Solid vol.
SC-17, No 3, pp. 614-619,1982
[9] A. Pua and L. Welch, "Issues in the Design in Domino Logic Circuits," IEEE Journal of
Solid-State Circuits, vol.32, no. 2, pp213-215, 1999
[10) V. Fried and S. Liu, "Dynamic Logic CMOS Circuits", IEEE Journal of Solid vol. SC-19,
No. 2,pp. 263-266,1984
[11) P. Larsson, C. Svensson, "Noise in Digital Dynamic Circuits", IEEE Journal of Solid
State Circuits, vol 29, No 6, pp. 655-662, 1994
[12) 1. M. Rabaey, "Digital Integrated Circuits" Englewood Cliffs, NJ: Prentice-Hall. 1996
[13) L. T. Wurtz "An Efficient Scaling Procedure For Domino Ct\10S Logic", IEEE Journal
of Solid-state vol. 2, No 22, pp.580-585, 1992
[14) W. C. Miller,G. A. Julien "Area-Time Analysis of CLA Using Enhanced Multiple Output
Domino Logic", IEEE Journal of Solid vol. 32, No 22, pp. 59-65, 1992
[15) X. Yu, V. Oklobdzija "Application of Logical Effort on design of Arithmetic Blocks",
Signal. System and Computer, Vol 1: pp 872-874, Nov 200 1
[16] D. L. Stasiak, S. N. Storino "440ps (w-bit Adder in 1.5V/0.18-um Partially Depicted SOl
Technology", IEEE Journal of Solid-State Circuits, \"01.36. No 10. pp. 1546-1552. 200
65
Vitae
Cong Nguyen was born in April 20th, 1962 in Viet Nam. He migrated to the US in 1986
and finished undergraduate study in Columbus, Ohio. After graduation, he worked for
Lucent Technology Inc, which was spun off from AT&T, in the area of VLSI, developing
and modeling custom high speed Library cells for Read-channel computer product
development. He moved, after few years, over to work for AAnet.com which is a network
product division of PMC-SIERRA inc located in Allentown, PA. He worked on the
SERDES (Serializer and Deserializer) chip and Analog circuit designs.
Recently, he is working at IBM in Poughkeepise and East Fishkill New York. The
projects that he has been involved in are the PowerPc for eSever, main frame server and
Intel low end series servers products. Currently, he is moving to IBM's EDA (Electronic
Design Automation) and Technology division to work and develop hardware models, and
Electronic Design Automation software tools for the Server and ASICs design
community.
66
END OF
TITLE
