Modeling and optimization of CMOS logic circuits with application to asynchronous design by Shams, Maitham
Modeling and Optimization of 
CMOS Logic Circuits 
with Application to  Asynchronous Design 
Maitham Shams 
A t hesis 
presented to the University of Waterloo 
in futfiknent of the 
thesis requirement for the degree of 
Do ctor of Philosophy 
hl 
Electrical Engineering 
Waterloo, Ontario, Canada, 1999 
@Maitham Shams 1999 
National Library l*l of Canada Bibliothèque nationale du Canada 
Acquisitions and Acquisitions et 
Bibliogiaphic Services services bibliographiques 
395 Wellington Street 395, rue Wellington 
OnawaON K I A W  Onawa ON K 1 A W  
Canada canada 
The author has granted a non- L'auteur a accordé une licence non 
exclusive licence ailowing the exchsive permettant à la 
National Library of Canada to Bibliothèque nationale du Canada de 
reproduce, loan, distriiute or sel1 reproduire, prêter, disbniuer ou 
copies of this thesis in microforni, vendre des copies de cette thèse sous 
paper or electronic formats. la forme de miuofiche/n]m, de 
reproduction sur papier ou sur format 
électronique. 
The author retains ownership of the L'auteur conserve la propriété du 
copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. 
thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels 
may be printed or otherwise de celle-ci ne doivent être imprimés 
reproduced without the author's ou aufiement reproduits sans son 
permission. autorisation. 
The University of Waterloo requires the signatures of all persons using or photocopying 
this thesis. Please sign below, and give address and date. 
Abstract 
CMOS remains the mainstream IC technology for the foreseeable future. This thesis ad- 
dresses modeling and optimization of conventional and differential CMOS logic styles, pro- 
vides insightfd analysis, and derives formulas for optimal transistor sizing of mixed logic- 
style CMOS circuits. Furthemore, as an application platform, the thesis deals with the 
less developed area of asynchronous circuits, rather than the commonly used synchronous 
circuits. 
The scope of the modeling and optimisation technique presented in this work covers the 
device, switch, logic, and module levels of abstraction. At the device level, we propose a 
simple model for evaluating the saturation current of submiaon MOS devices. This model 
reproduces the short-channel characteristics of modern MOS transistors accurately. At the 
switch level, we recognize and model four types of delays: PMOS nsing delay, NMOS falling 
delay, NMOS &hg delay, and PMOS falling delay. The delay models at this level, capture 
the &ect of input signal dope and characterize the behaviour of MOS transistors connected 
in series. At the logic level, we apply the switch-level delay models to formulate delay 
macromodek for different CMOS logic styles i n d u h g  conventional, DCVSL, and CPL. We 
also derive dosed-form optimal transistor sizing formulas for several popular CMOS logic 
styles. At the module level, using the optimal transistor sizing formulas, we demonstrate 
that it is feasible to optimize the d&y of a circuit hvolving mixed CMOS logic styles. 
Part of this work is devoted to comparing different CMOS implementations of logic gates. 
We develop a fair method for this parpose and study the performance and energy consump 
tion of various conventional and clifkentia1 CMOS implementations of the Gelement and 
XOR gate, which are the most widely used primitives in asynchronons control &cuits. For 
each primitive, we express our recommendation regarding the most appropriate implemen- 
tation. We also introdnce a differential logic style that has a static memory and, hence, is 
suitable for implementing primitives such as the Celement. 
Finally, a theory of delay optimization evolves fkom ont work that states the delay in a 
circuit c o k t i n g  of conventional CMOS logic gates is minimal if for each stage dong the 
critical path of the cirmàt, the delay due to that stage os a Iwd equols the delay fimugh that 
stage as a drive. 
Acknowledgment s 
Whoever doesn't thank ot hers, hasn't indeed thanked Gad. 
Apostle of God, Moharnmad (S) 
This t hesis would not have been possible without the generous support, patient guidance, 
and constructive critiusm of my supervisors Dr. Mohamed Elmasry and Dr. JO Ebergen. 1 
am deeply gratefd to them, especially for their trust on my work. The multi-dimensional 
personality of Dr. Elmasry has taught me that it is possible to combine academic excellence 
with social activities and observation of religion duties. Dr. Ebergen's high standards of 
darity, conciseness, and ngour will c e r t d y  benefit all my fature endeavours. 
1 would like to thank the other members of my Ph.D. thesis examination cornmittee, 
Dr. Graham Jullien, Dr. Manoj Sachdev, Dr. Cathy Gebotys, and Dr. Farhad Mavaddat for 
reading this thesis and for theh comments. 1 wodd like to extend my thanks to Dr. John 
Brzozowski for his fruitfùl remarks in o u .  Maveric Croup meetings. 1 am t h & !  to the 
graduate secretary of our department, Wendy Boles, for her fnendly and invaluable help. I 
would also like to thank the cornputer system administrator of VLSI Research Group, Phi1 
Regier, and the group's secretary, Gehan Sabry, for their prompt assistance. 
1 greatly appreciate the companionship of many w o n d d  fiiends and colleagues who 
have made my stay in Waterloo a rewarding experience. Althongh 1 cannot list all th& 
names, 1 wil l  definitely rememba th& favours. Speual thanks to Nasser Masoumi and 
Majid Soleimanipour, my officemates daring the last couple of years of my Ph.D. program. 
Most importantly, I would like to express my sincere gratitude to my carhg parents 
for th& conntless blessings, to my brothers and sister for their unconditional love, to my 
d e  Sara for her encouragement and prayers, and to our son Mohammad Amin for all the 
happiness he has brought to ont lives. 
Finally, I must admit that 1 am indebted to many other people in one way or another. 
To all of them, 1 would iike to Say Thank Y o d  
Maitham Shams 
Waterloo, Canada 
19 May 1999 
In the Name of God, 
the Compassionate, the Mercifil 
To Imam Baqer (A) 
Who split the seed of knowledge Widely! 
My PhD. defense date coincides with the inspiring occasion of the 1363rd birthday of Imam 
Baqer (A). He is the grandson of Imam Humain (A), the grandson of Prophet Mohammad (S). 
Contents 
1 Introduction 
. . . . . . . . . . . . . . . . . . . . . . . . . . .  1.1 Perspective and Motivation 
. . . . . . . . . . . . . . . . . . . . . . . . . . .  1.2 Basic Tems and Definitions 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1.3 MOSFET Operation 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1.4 CMOS Logic Styles 
1.5 Scope of Thesis Based on Abstraction Levels 
and  Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10 
1.5.1 Device Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  12 
5 . 2 SwitchLevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  13 
1.5.3 Logic Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14 
1.5.4 Module Levei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  16 
1.6 ThesisOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  17 
2 Asynchronous Circuits 19 
2.1 Motivations for Asynchronous Circuits . . . . . . . . . . . . . . . . . . . . .  19 
2.1.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20 
2.1.2 TmmunitytoMetastableBehavior . . . . . . . . . . . . . . . . . . . .  21 
2.1.3 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  21 
2.1.4 LowPower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  22 
vii 
. . . . . . . . . . . . . . . . . . . . . . . .  2.1.5 Freedom from Clock Skew 23 
. . . . . . . . . . . . . . . . . . . . . . . . . . . .  2.2 Models and Methodologies 23 
. . . . . . . . . . . . . . . .  2.2.1 Signaling Protocols and Data Encodings 24 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2.2.2 Delay Models 27 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2.2.3 Formalisms 28 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2.3 Design Techniques 28 
. . . . . . . . . . . . . . . . . . . . .  2.3.1 Types of Asynchronous Circuits 29 
2.3.2 Asynchronous Sequential Ma&.ines . . . . . . . . . . . . . . . . . . .  30 
2.3.3 Speed-Independent Circuits and STG synthesis . . . . . . . . . . . . .  31 
2.3.4 Delay-Insensitive Circuits and Compilation . . . . . . . . . . . . . . .  31 
. . . . . . . . . . . . . . . . . . . . . . . . .  2.4 A Typical Asynchronous Design 32 
. . . . . . . . . . . . . . . . . . . . . . . . . .  2.4.1 The Control Primitives 32 
. . . . . . . . . . . . . . . . . . . . . . . . . . . .  2.4.2 Storage Primitives 34 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2.4.3 Pipelinhg 36 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2.5 Concluding Remarks 38 
3 Qui& Evaluation and Optimization of CMOS Circuits 39 
. . . . . . . . . . . . . . . . . . . . . . . . . . .  3.1 SingleStageCMOSInverter 40 
3.2 Cascaded CMOS Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  43 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3.2.1 Total Delay 43 
. . . . . . . . . . . . . . . . . . . . . . . .  3.2.2 Rising and Falling Delays 46 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3.2.3 Energy and Area 47 
. . . . . . . . . . . . . . . . . . . . . . . . . .  3.2.4 Energy-Delay Product 48 
. . . . . . . . . . . . . . . . . . . . . .  3.3 Some Applications and Observations 50 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3.3.1 Inverter Chain 50 
. . . . . . . . . . . . . . . . . . . . . . . . . .  3.3.2 Tapered B d e r  Design 52 
. . . . . . . . . . . . . . . . . .  3.3.3 Generation of Complementary Signals 54 
. . . . . . . . . . . . . . . . . . . .  3.4 Extracting Delay and Energy Parameters 55 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3.5 Concluding Remarks 56 
4 Single-Rail CMOS Implementations of the C-Element 58 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4.1 The C-element 59 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4.2 Dynamic C-element 60 
. . . . . . . . . . . . . . . . . . .  4.3 Standard Implementation of the C-element 62 
4.4 Implementation of the C-element 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  with Weak Feedback 63 
. . . . . . . . . . . . . . . . . .  4.5 Symmetric hplementation of the C-element 67 
. . . . . . . . . . . . . . . . . . . . .  4.6 Effect of Arriving-Time Order of Inputs 69 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4.7 First Test Environment 69 
. . . . . . . . . . . . . . .  4.8 Comparing the Model with the Simulation Rsults 73 
. . . . . . . . . . . . . . . . . . . . . . . . . . . .  4.9 Second Test Environment 76 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4.10 Concluding Remarks 81 
5 Differential Implementations of the C-Element 83 
. . . . . . . . . . . . . . . . . .  5.1 Basic DIL Implementation of the C-element 84 
. . . . . . . . . . . . . . . . . . .  5.2 Divergence of the Complementary Outpats 88 
. . . . . . . . . . . . . . . . . . . . . . . .  5.3 DILP and DICN Implementations 90 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5.4 Results and Discussion 94 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5.5 Conduding Remarks 96 
6 Delay Modeling at the Device and Switch Levels 97 
6.1 On Short-Channel Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . .  98 
. . . . . . . . . . . . . . . . . . . . . . . . . .  6.2 MOSFET Delay and Current 99 
6.3 A New Mode1 for MOSFET Saturation Current . . . . . . . . . . . . . . . .  101 
6.4 Generalization of the Modd . . . . . . . . . . . . . . . . . . . . . . . . . . .  104 
6.5 Effect of Input Waveform Slope on Delay . . . . . . . . . . . . . . . . . . . .  107 
6.6 Overlapping and Opposing Currents . . . . . . . . . . . . . . . . . . . . . . .  108 
6.7 MOS Transistors Comected in Series . . . . . . . . . . . . . . . . . . . . . .  112 
6.8 Extraction of Delay Parameters . . . . . . . . . . . . . . . . . . . . . . . . .  114 
6.9 Extraction of MOSFET Capacitances . . . . . . . . . . . . . . . . . . . . . .  116 
6.10 Conduding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  120 
7 Modeling and Optimization of CMOS Logic Styles 121 
7.1 Conventional CMOS Style . . . . . . . . . . . . . . . . . . . . . . . . . . . .  122 
7.1.1 Conventional XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  129 
7.2 DCVSL CMOS Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  133 
7.2.1 DCVSL XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  138 
7.3 PTL CMOS Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  143 
7.3.1 Critical Path Involvhg GDrive . . . . . . . . . . . . . . . . . . . . .  144 
7.3.2 Critical Path hvolving %Drive . . . . . . . . . . . . . . . . . . . . .  145 
7.3.3 Critical Path Involving Both GDrive and SDrive . . . . . . . . . . .  147 
7.3.4 CPLXOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  150 
7.4 Comparing CMOS Implementations of XOR . . . . . . . . . . . . . . . . . .  151 
7.5 Conduding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  152 
8 Module Level Delay Estimation and Optimization 154 
. . . . . . . . . . . . . . . . . . . . . . .  8.1 Conventional CMOS Logic Circuits 155 
. . . . . . . . . . . . . . . . . . . . . . . .  8.2 Mixed Logic Style CMOS Circuits 159 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  8.3 Conciuding Remarks 164 
9 Conclusion 167 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9.1 Review 168 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9.1.1 Delay Modeling 168 
. . . . . . . . . . . . . . . . . . . . . . . . . . . .  9.1.2 Delay Optimization 170 
. . . . . . . . . . . . . . . . . . . . . . . .  9.1.3 Circuits and Applications 172 
. . . . . . . . . . . . . . . . . . . . . . . . . .  9.2 Directions for Future Research 174 
. . . . . . . . . . . . . . . . . . . .  9.3 Publications That Arose from the Thesis 175 
Bibliography 
List of Figures 
1.1 ACMOSgate(inverter)anditsenvironment . . . . . . . . . . . . . . . . . .  3 
1.2 CMOS implernentations of XOR in different logic styles . . . . . . . . . . . . .  9 
1.3 Scope of our modeling technique presented in this thesis . . . . . . . . . . . .  11 
1.4 Organization of the chapters and their relations in the thesis . . . . . . . . . .  18 
2.1 Two different data communication schemes . . . . . . . . . . . . . . . . . . .  25 
2.2 Data transfér in tw~phase signahg (a). and four-phase signaling (b) . . . .  26 
2.3 Some delay-insensitive primitives . . . . . . . . . . . . . . . . . . . . . . . .  33 
2.4 Two event-driven latch implementations . . . . . . . . . . . . . . . . . . . .  34 
2.5 A CMOS implementation of a doubl~throw switch . . . . . . . . . . . . . .  35 
2.6 A four-stage micropipeline FIFO structure . . . . . . . . . . . . . . . . . . .  36 
2.7 A general four-stage micropipeline structure . . . . . . . . . . . . . . . . . .  37 
3.1 Layont of a single MOS transistor . . . . . . . . . . . . . . . . . . . . . . . .  40 
3.2 Schematic of a CMOS inverter driving a capacitance C . . . . . . . . . . . . .  41 
3.3 Simulation resdt s for a chain of five inverters to extract the d u e  of p . . . .  43 
3.4 A CMOS driver. gate (cd). and load represented by CMOS inverters . . . .  44 
3.5 Vaxïations of delay as a fanction of the size of the c d  for s k e d  driver size 
and various load sizes . Minimums obt ained by the mode1 asing equation 3.10 
are indicated on each curve. . . . . . . . . . . . . . . . . . . . . . . . . . . .  45 
3.6 Variations of optimal r with p for delay and energy-delay product . . . . . . .  48 
3.7 Variations of the delay D. energy E .  and energy-delay product F for a driver. 
. . . . . . . . . . . . . . . . . . . . .  ceil. and a load based on the formulation 49 
3.8 Variations of energy-delay product as a function of the size of the c d  for a 
fixed driver size and various load sizes . Minimums obtained by the model 
. . . . . . . . . . . . . . . . .  using equation 3.25 are indicated on each curve 50 
3.9 Variations of the delay (period P) . energy E. and energy-delay product F in 
an inverter chain as a function of the PMOS to NMOS transistor size ratio r 
. . . . . . . . . . . . . . . . . . . . . . . . .  based on simulations and mode1 51 
3.10 Tapered buffer circuit for driving large load . . . . . . . . . . . . . . . . . . .  52 
3.11 The Omega finction Q ( x )  (top) and the optimum tapering factor ,û as a 
function of the bufferls intrinsic to output capautance 6'16 (bottom) . . . . .  53 
3.12 Circuit for producing complementary signals . . . . . . . . . . . . . . . . . . .  54 
3.13 The circuit used to extract the delay and energy parameters . through simulation 56 
4.1 State diagram and schematic of the Celement . . . . . . . . . . . . . . . . . .  60 
4.2 Dynamic implementation of the Celement . . . . . . . . . . . . . . . . . . . .  61 
4.3 Standard implernentation of the C-element [92] . . . . . . . . . . . . . . . . .  62 
4.4 hplementation of the Gelement with weak feedback inverter [55] . . . . . . .  64 
4.5 Implementationof the Gelement withresistiveinverterat the feedback . . . .  66 
4.6 Symmehic implementation of the C-element [2] . . . . . . . . . . . . . . . . .  68 
4.7 First measmement setup: testhg for optimal sizing for known fanout . . . .  70 
4.8 SPICE simulation resdts. energy versus delay. for the Gelement gates in the 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  fist  test setup 72 
4.9 Energy-Delay graph of the C-element implementations in the first test envi- 
. . . . . . . . . . . . . . . . . . . . . . . . . . .  ronment based on the mode1 74 
4.10 Second measmement setup: testing for optimal sizing of a chah structure in 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  presence of feed.back 77 
The dynamic C-element in the second test environment. . . . . . . . . . . . 
Optimization of the dynarnic C-element in a micropipeline control circuit. . . 
Energy-fkequency graphs based on SPICE simulations for the single-rail C- 
element implementations under the second test. . . . . . . . . . . . . . . . . 
DIL implementation of the C-element. . . . . . . . . . . . . . . . . . . . . . 
Delay and Energy of the DIL implementation for various sizes of the PMOS 
device in the latch. . . . . , . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Moditied DIL (MDIL) implementation of the C-element . . . . . . . . . . . . 
Delay and Energy of the MDIL implementation for various sizes of the output 
inverter (fan-out of three inverters) . . . . . . . . . . . . . . . . . . . . . . . 
Divergence of the complementary outputs in a micropipeline control circuit: 
outputs of different stages at the fkst cycle (top), outputs of the same stage 
at different cycles (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Schematics of t he  DILP and DILN C-element implementations. . . . . . . . 
HSPICE results for the DILP C-element implementation. . . . . . . . . . . . 
HSPICE results for the DILN Celement implementation. . . . . . . . . . . . 
HSPICE results for the singlerail and double-rail C-element implementations 
under the first test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Simulation result s for the single-rail and double-rail CM0 S implement ations 
of the Celement in a micropipeline environment. . . . . . . . . . . . . . . . 
IV-characteristic m e s  for a 0.5 pm NMOS transistor showing the change in 
m e n t  during tranderring a logic 1 and a logic O. . . . . . . . . . . . . . . . 
S i d a t e d  and caldated values of Io as a huiction of Vcs using the a-power 
law and the new model. The tkeshold voltage of the device VTN x 0.66 V. . 
S tep delay of an NMOS transistor discharging an output capacitance as Vm 
changes. HSPICE simulation results are compared wit h the results ob t ained 
by the a-power law and the results obtained by the new model represented 
by s m d  cirdes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Step delay of an NMOS transistor charging an output capautance as VDD 
changes. HSPICE simulation results are compared with the results obtained 
by the new model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Charging and discharging delays of PMOS transistor as obtained by simula- 
tions and using the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
A cornparison between the results of HSPICE simulations and the delay model 
for the four delay types: NMOS Falling (a = 1.2), PMOS Rising (a = 1.35), 
NMOS Rising (a = 2), and PMOS F a h g  (a = 2). . . . . . . . . . . . . . . 
Variations of the step and ramp (r = 1 ns) delays of a CMOS transmission 
gate (W, = 10 pm and W, = 20 pm) discharging an output load (C = 1 pF). 
Variations of the step and ramp (7 = 1 ns) faUing delays of a CMOS structure 
involving opposing currents (W, = 10 pm, W, = 20 pm, and C = 1 pF). . . 
An equivalent RC chain for series-connected MOS transistors. . . . . . . . . 
A CMOS inverter chah  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
A conventional CMOS cd between a driving gate and output load. . . . . . 
Canry generation circuit in a mirror adder. . . . . . . . . . . . . . . . . . . . 
Informal daim regarding optimization of conventional CMOS circuits. The 
delay may be a rising delay, falling delay, or average delay. . . . . . . . . . . 
Conventional CMOS XOR ImpIementation. . . . . . . . . . . . . . . . . . . . 
Optimal transistor sizing of the conventional XOR gate using the derived for- 
mulas. The solid lines are ob tained with the initial values of W, = W, = w = 
1 p. The dashed lines are obtained with the initial values cdculated fkom 
the approximated formulas. . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Optimal transistor sizing of the conventional XOR gate using the dday models 
and optimization package of MATLAB. . . . . . . . . . . . . . . . . . . . . . 
Delay estimation for the conventional XOR gate using simulations and the 
model. The width of PMOS transistor W, is fixed at wP = 41 pm obtained 
by the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
A DCVSL CMOS ceil between a driving gate and output load. . . . . . . . . 
Schematic of DCVSL XOR gate. . . . . . . . . . . . . . . . . . , . . . . . . 
Delay estimation and optimization for the DCVSL XOR gate using simulations 
and the model. The ratio W,/ W, is kept constant to find the optimum W,. . 
Delay estimation and optimization for the conventional XOR gate using simu- 
lations and the model. W,, is kept constant to find the optimum W,/W,. . . 
Energy estimation for the conventional XOR gate asing simulations and the 
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  
A PTL CMOS cell between driving gates and output load. . . . . . . . . . . 
A PTL CMOS c d  connected to an S-drive dong the critical path. A CPL 
ORINOR gate is also shown as an example. . . . . . . . . . . . . . . . . . . . 
A PTL CMOS c d  connected to an S-drive dong the critical path controlled 
by a G-drive. A CPL XOR/XNOR gate is also shown as an example. . . . . . 
An RC network for modehg the dday in PTL circuits involving both a G 
drive and an S-drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Schematic of CPL XOR gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 
Delay estimation for the conventional XOR gate using simulations and the model. l5 1 
Delay and energy dissipation vemu VDD for the optimized CMOS implernen- 
tations of the conventional (standard), DCVSL, and PTL (CPL) XOR gates. 
The drive's W* / Wh = 20110 and total CL = 200 ni. . . . . . . . . . . . . 153 
Three stages dong the critical path of a conventional CMOS logic circuit. . . 156 
An exampie of a critical path in a conventional CMOS logic circuit. . . . . . 158 
8.3 A four-btwo phase converter [34]. . . . . . . . . . . . . . . . . . . . . . . . 162 
8.4 Schematic of a conventional implementation of the TOGGLE. . . . . . . . . . 163 
8.5 Transistor-level schematic of an asynchronous four-to-two phase converter. . 165 
0.1 General schematics of a conventional (top), DCVSL (rniddle), and PTL (bot- 
tom)gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 
List of Tables 
. . . . . . . . . . . . . . . . . . . . . .  1.1 S hodey Model for an MOS transistor 
3.1 Delay And Energy Parameters Extracted for a 0.8 pm Bicmos Technology at 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  V a = 3 V a n d ~ = 2  55 
4.1 Comparing SPICE simulation results with those of the analytical model for 
. . . . . . . . . . . . . . . . . . . . . . . . . .  the C-element implement ations 75 
6.1 Current Model Parameters. Symbols. and Values for a 0.5 pm CMOS Tech- 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  nology 115 
6.2 Delay Degradation Parameters. Symbols. and Values for a 0.5 pm CMOS 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Technology 115 
6.3 Gate and Diffusion Capacitances per Unit Width: Symbols and Values for a 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  0.5 pm CMOS Technology 116 
7.1 Optimal transistor sizing for the conventional XOR gate . . . . . . . . . . . . .  130 
7.2 Optimal transistor sizing for the DCVSL XOR gate . . . . . . . . . . . . . . .  139 
8.1 Optimal transistor sivng for the aitical path of Figure 8.2. . . . . . . . . . .  158 
8.2 Optimal transistor sizing of CMOS logic styles. Terms enclosed within "< >" 
should not be induded if the stage being optimized is the last stage. Notation: 
subscripts n (NMOS), p (PMOS), D, G, S (drives), and L (load); Accents: " '" 
(rising transition) and "' " (falling transition); A = fip/ ir,. The parameters 
are defined in Chapter 6 and Chpater 7 with reference to Figure 7.1 for con- 
ventional CMOS, Figure 7.8 for DCVS1, and Figure 7.13 for PTL. . . . . . . 160 
8.3 Optimal transistor sizing for the four-tetwo phase converter of Figure 8.5. . 164 
9.1 Optimal transistor sizing in CMOS logic styles for rninimizing the delay over 
one cycle. Notation: subscripts n (NMOS), p (PMOS), t (total NMOS+PMOS), 
D. G, S (drives), and L (load); Accents: ' (rising transition) and ' (falling 
transition); A = 41 Y,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 
Nomenclat m e  
Width ratio of PMOS transistor to NMOS transistor 
NMOS to PMOS driveability ratio 
Velocity saturation index 
Tapering factor in b u f k  chain 
Captures mobiüty degradation &ect 
Technology related parameter 
Carrier mobility 
MOSFET resistance times unit width 
Captures velocity saturation d e c t  
NMOS to PMOS driveability ratio 
Transition t h e  of input signal 





MOSFET drain m e n t  
Unit falling delôy due to transistor gate 
Unit f d h g  delay due to transistor *on 
Unit energy dissipation due to transistor gate 
Unit energy dissipation due to transistor diffnsion 
MOSFET effective gate iength 
R Resis t ance 
S Signal dope factor 
W MOSFET effective gate width 
VDD Power supply voltage 
VT Threshold voltage 
X Delay degradation factor for load capacitance due to seridy connected MOSFETs 
Y Delay degradation factor for interna1 capacitances due to seridy connected MOSFETs 
d Diffusion capacit ance 
9 Gate capacitance 
m Number of transistor gates related by symmetry 
n Number of transistors connected in series 
Q Number of transistor diffusions related by symmetry 











Correspondence to cd or logic gate 
Correspondence to driving logic gate or Delay 
Correspondence to output load 
Correspondence to transistor diffason 
Correspondence to transistor gate 
Correspondence to NMO S transistor 
Correspondence to PMOS transistor 
Correspondence to rising delay 
Correspondence to falling delay 
Correspondence total or average delay 
Chapter 1 
Introduction 
Since it s incep tion in the early 1960s, Complementary Met al Oxide Silicon ( CMOS) technol- 
ogy has sustained a tremendous evolution in complexity and performance. Today's CMOS 
microprocessors contain millions of transistors and operate at speeds approaching one GHz. 
Without the low power consumption of CMOS circuits, it would have been impossible to 
integrate so many transistors on a die running at such speeds. This low power consmption 
combined with large noise margins, another intrinsic feature of CMOS circuits, has made 
portable computers and personal digital assistants a reality. In other words, CMOS is the 
enabling technology for the modern information age. Moreover, the trend towards higher 
integration densities and the increasing demands for high-speed and low-power integrated 
electronics secare the position of CMOS as the mainstream VLSI technology for the years 
to corne. 
This dissertation de& with an essential concept in the field of CMOS technology; that is, 
modeling and optimkotion of digital CMOS ci~cuits. This concept is almost as old as CMOS 
technology itself and, yet , remains a major focus of research. The thesis, however, attempts 
to present a more comprehensive and intuitive perspective of this concept. The thesis covers 
the modeling and optimization of a variety of CMOS logic styles, provides insightfd analysis, 
and derives closed-form formulas for optimal transistor sizing of digital CMOS circuits. 
Furthemore, as an application platform, the thesis de& with the less developed area of 
asynchronous circuits, rather than the commonly used synchronoas circuits. Although the 
primary conceni of optimization in this work is the perfoîmonce of a design, the energy 
CHAPTER 1. INTRODUCTlON 2 
dissipation is carefully observed to avoid wasting resources. In particular, when comparing 
two designs performing a similar function, both the performance and energy are considered. 
This introductory chapter continues with a section on the perspective and motivation of 
this work. This is followed by some background materid organized in three sections. The 
first one defines basic terminologies Like the delay and energy. The second one explains the 
operat ion of MOS transis tors. And the t hird background section briefly discusses different 
CMOS logic styles. Due to the nature of this work and the chosen area of application, 
it relates to previous work from various areas. A major part of this chapter is devoted to 
explaining the scope of the thesis in relation to rehted work 8om the literature. This chapter 
concludes with an overview of the chapters of the thesis. 
1.1 Perspective and Motivation 
Although the literature is abundant with accurate models for evaluating and optimizing the 
performance of digital CMOS gates, especially the inverter, they are often too complicated 
for intuitive analysis and quick optimisation. To darûy our point, consider Figure 1.1. The 
figure shows the simplest CMOS logic gate, an inverter, and its environment. This is a basic 
and the least complicated scenario in a digital circuit. The environment usually consists of 
0 t h  CMOS gates. but the details are irrelevant now. Even for this case, without resorting 
to circuit simulators and CAD tools, the foUowing problem seems quite challenging. 
Design the gate such that the rising delay is minimal. 
Similar problems may be posed for the falling and average delays. Unfortanately, the liter- 
atnre does not have accurate, short, and expliut answers to these problems. Instead, the 
literature usually offers a solution using mathematical programming in dealing with such 
problems. 
Why is snch a simple case important, one may ask, espedally when there are CAD tooh 
that handle much more complicated situations? Wd, there are a namber of good reasons. 
1. The basic cases are related to o u  fundamental knowledge of handluig digital ck- 
cuits. The approach of mathematical programming actively pursued in the literature, 
CHAPTER 1. INTRODUCTION 
CMOS Gate 
1 Environment 1 4 1 1 Environment 
l 
Figure 1.1: A CMOS gate (invater) and its environment. 
although capable of treating circuits with a large number of gates and sometimes in- 
evitable, fails to provide an intuitive unders t anding of optimal circuit design. 
2. If we derive a relation between the optimal sizing of a gate and its input and output 
environment, we would be able to use the results to optimize a criticai path of an 
arbitrary nurnber of gates by dividing it into smaller sections of three gates each. As 
far as the literature is concerned, this seems an innovative and qui& technique. 
3. There is usually more than one CMOS topology to implement a logic fuction. We may 
enhance our optimization technique by choosing the right gate topology in addition to 
the right transistor sizing. To pedorm a fair cornparison between different implemen- 
tations of a hc t ion ,  however, it is necessary to first optimize each implementation 
for the given environment and then evaluate its performance. So far, this concept of 
fairness has been ignored in the literature to a large d e n t .  
These points sammarize the perspective fiom which this thesis approaches the problem of 
modeling and optimization of digital CMOS circuits. 
Another concem is that the litaature on modeling and opthbation of digital CMOS 
circuits is confined to the conventional CMOS design style. This confinement is in spite of 
the fact that in many cases anconventional styles may be admtageous in terms of speed or 
energy dissipation. Models for evalaation and optimixation of unconventional CMOS styles 
enables mixing conventional and unconventional gates for optimieing the overd performance 
of a &cuit. The reader may thhk of unconventional CMOS styles as implementations 
CHAPTER 1. INTRODUCTION 4 
not resembling the structure of the inverter in Figure 1.1, which is a conventional CMOS 
implement ation. 
For the applications of this work, we have turned our attention towards the area of 
asynchronous design, because this area has become increasingly promising. In synchronous 
circuits, it is vitally important to simultaneously supply all components of a system with a 
clock edge. In the view of demands for even higher speeds and chip densities, this is becoming 
an increasingly chailenging task. Since the late 1980s, the VLSI community has been exerting 
an intens3ying effort on designing asynchronous circuits that, unlike synchronous circuits, 
operate without a global clock. Asynchronous circuits may also be beneficial in terms of 
pedormance, power dissipation, noise immunity, electromagnetic compatibility, and ease of 
interfacing. A separat e chapter details the motivations for asynchronous circuit design. 
In short, this work deviates from the current litexature in the following aspects. 
O dpp~oach:  Provides an intuitive understanding and explkit models leading to closed- 
foml formulas. 
a Comprehensiveness: Covers both the conventional and unconventional CMOS styles, 
such as DCVSL and CPL. 
0 Application: Inves tigates optimal implement ations of asynchronous primitives and cir- 
cuits. 
1.2 Basic Terms and Definitions 
This section introduces and defines the basic mathematical expressions for the tenns perfor- 
mance, May, energy, and power, which are fiequently used thronghoat this thesis. 
The performance is defmed as the inverse of the propagation delay, alPo cslled the average 
delay b, which is expressed in tenns of the ri~ing delay 15 and the falling delay b. 
CHAPTER 1. INTRODUCTION 5 
The falling and rising delays are defined as the time interval between the middle of the 
input voltage swing (VDD/2) and the middle of the output voltage swing (Vm/2), when the 
output signal is failing and rising, respectively. In CMOS circuits, the power supply voltage 
is usuaily denoted by VDD. We often find the notion of total delay b more corivenient than 
the average delay. The total delay signifies the delay over one suitching cycle of the output. 
A switching cycle consists of a rising and a falling transition. 
The delay is always associated with charging or discharging capacitances. The general 
delay equation states that the dday for charging an initidy empty capacitance C to a 
potential V with an average current I equals 
This equation also governs the discharging delay of C from an initial potential V to ground. 
In transistor circuits, (1.3) can be directly applied by substituting V with VDD f 2 ,  if the 
input signal of a gate is a step waveforrn. The delay in response to such an input is c d e d  
the step delay. In practice, however, a circuit signal exhibits an exponential behaviour and 
is best approximated by a ramp waveform. The delay in response to a ramp input is called 
the ramp delay. Deriving a generd mode1 for the ramp delay h a  proven complicated. Under 
certain restrictive, but realistic, assumptions the ramp delay of a gate can be expressed as a 
linear combination of two step delays, the step dday of the gate itself and the step delay of 
the preceding gate. 
The power dissipation is the energy per unit of time. Therefore, the average power 
dissipation P over a period T is given by 
where E is the total amount of energy dissipated, Q is the total charge co~lsumed, V is 
the average voltage, and I is the average curent. Accordingly, if the energy consnmed by 
a certain task is reduced in proportion to the time taken to p d o m  the task, the power 
remains constant, while obviously fewer number of electnc charges are spent. The power 
CHAPTER 1. INTRODUCTION 6 
is only a rneasure of expenditure per second, but doesn't give any indication of what is 
accomplished for the cost. This can be better understood in terms of the energy per task. 
Thus, we prefer to use the tenn energy, which implies the energy dissipation per switching 
cycle. 
Like the delay. the energy is also associated with capacitances. The energy for charging 
an initially empty capacitance C to a potential V thrmgh a power supply voltage VDD is 
expressed by 
Since no energy is spent for discharging C, (1.5) represents the energy per switching cycle 
of C as w d .  It is important to make a distinction between the energy stored in C and the 
energy consumed for charging C given by (1.5). The energy stored in C when charged to V 
c m  be formulated by 
where Vc represents the potential across C. Therefore, E is always lsrger than Ec. Since the 
energy is conserved, the amonnt of energy E - Ec must have been dissipated in the form 
of heat by the resistance connecting the power supply to C. The amount of electricd energy 
converted to heat is, interestingly, independent of the value of this resistance. In a similar 
fashion, it can be deduced that during the discharging process, the total energy stored in 
C is also converted to heat by the resistance connecting C to &round. Another point worth 
mentionhg is that there are two types of energy loss in transistor circuits: dynamic and 
static. The dynamic energy is only dissipated when the nodal capautances are switching, 
whereas the static energy is continuously consamed as long as the circuit is supplied with 
power. Nevertheless, the relative magnitude of the static energy is so s m d  that it is always 
ignored during switching. Hence, (1.5) only applies to the dynamic energy in circuits. 
1.3 MOSFET Operation 
In a MOSFET the channeî m e n t  is moddated through the gate voltage. The conventional 
MOSFET mode1 proposed by Shockley in 1952 [87], when device channels exceeded 10 Pm, 
CHAPTER 2 .  INTRODUCTION 7 
is summarized in Table 1.1. This model is still used in textbooks for hand-analysis of MOS 
circuits [72,102]. It is, however, very inaccurate in predicting the  behavioar of today's 
submicron short-channel MOS devices. We use the mode1 here to introduce a number of 
parameters. 
Table 1.1: Shockley Mode1 for an MOS transistor. 
1 Saturation IrD=A (, ) r cm (VGS - VT)* 
Operation Mode 
Cutoff 
Parameters L and W are the effective device length and width respectively, p is the 
mobiüty of the carriers in the channd, and C, is the gate oxide capacitance per unit ares . 
given by 
where E is the permittivity of the gate oxide and t, is the gate thickness. Often, the process- 
dependent factors are combined into k = pC-. Then, the transistor gain factor P can be 
expressed in terms of technology and geometry factors as 
Current 
ID = O 
The expressions given in Table 1.3 are equdy valid for N- and P-type devices when VT 
is replaced with VTN and VTP, respectively; except that the negative sign of the P-type 
threshold voltage must be accounted for. According to Shockley's model, the operation of 
an NMOS transistor is as follows. If a voltage greater than some threshold, VTN, is applied 
to the gate, the substrate sudace beneath the gate is inverted and an Ktype channel is 
formed. No m e n t  passes tkough the Channel if the source and drain are both grounded 
(cntofF mode). However, as the drain-t-source voltage VDs, is raised, an a h o s t  linearly 
proportional current la. is established (linear mode). When Vos is increased mch that VGD 
Voltages 
IVGSI < IVTI 
CHAPTER 1. INTRODUCTION 8 
falls below VTN, the channel no longer reaches the drain and is pinched-off. In this case, 
Channel electrons are injected into the drain and the current is controlled by the gate voltage 
alone, independent of the drain voltage (saturation mode). The voltage across the pinched-off 
channel &O remains fixed at Vos - VTN regardless of the drain voltage. Shockley's Model 
is particularly unreliable in reprodncing t h e  saturation mode behaviour of short-channel 
devices. For an intuitive understanding of the so called short-channel effects, the reader may 
refa to [76,95]. 
1.4 CMOS Logic Styles 
A PMOS transistor is a good transmitter of a logic 1 and a weak transmitter of a logic 
O. An NMOS transistor, on the other hand, is a good transmitter of a logic O and a weak 
transmitter of a logic 1. Therefore, PMOS transistors are normally used in the pull-up 
sections of CMOS logic gates and NMOS transistors are normally used in the pull-down 
sections of CMOS logic gates. This original style of implementing a CMOS logic gate is 
t emed conventional. Some O t her popular CM0 S logic styles , however , use transistors of 
the same type to play both pull-up and pull-dom roles. These CMOS logic styles are 
generally referred to as unconventionaL The conventional logic style is still the most widely 
practiced, becaase it is more familiar, easier to automate and, most importantly, offas a 
good balance of performance anci energy dissipation. Nevertheless, nsing an unconventional 
logic style is sometimes beneficial in reducing the device count for implementing a h c t i o n  
which, in t m ,  rnay improve the delay, energy consumption, and area. There is a varie& of 
unconventional CMOS logic styles. Occasionally, a logic function may even have more than 
one conventional CMOS implementation, iike that of the Celement studied in later chapters 
of this thesis. Within the conventional category, we refer to the CMOS implementation 
obtained fiom a Boolean function throagh the standard procedure oatlined in text books 
like [IO21 as the standard CMOS implementation of a logic gate. Figure 1.2 iuustrates the 
schematics of a conventional and two unconventional CMOS implementations of the XOR 
gate. The primary inputs of the gate are denoted by a and b and their complements are 
denoted by a' and b', respectively. Similarly, c c d  c' denote the output of the gate and its 
complement , respectidy. 
CHAPTER 1. INTRODUCTION 
+ + + O a' O a' 
Standard DCVSL CPL 
(Conventional) (Unconventional, Di fferential) (Unconventional, Differential) 
Figure 1.2: CMOS implementations of XOR in different logic styles. 
A dominant group of unconventional CMOS logic gates belong to the category of differ- 
ential logic styles. A diffaential logic gate asudy has a symmet&ii structure and requires 
both the input signals and their complements. In return, it produces the output and its corn- 
plement . Alt hough differential logic circuits typically double the wiring requirements, they 
counterbalance this deficiency by lowering the device count. The most well-known differen- 
t id  CMOS Iogic families are Differential Cascade Voltage Switch Logic (DCVSL) (421 and 
Complement ary P ass-'hansis t or Logic ( CPL) [loti]. The unconventional implement ations of 
the XOR gate depicted in Figure 1.2 belong to these two logic styles. 
A DCVSL gate consists of a network of NMOS transistors and a couple of cross-coupled 
PMOS transistors forming a dynamic latch. When one of the ontputs is p d e d  d o m  by 
the NMOS network, the other output is p d e d  up throngh the corresponding PMOS tran- 
sistor. Hence, anlilte the case of a conventional gate, the NMOS network of a DCVSL gate 
is partidy involved in the pull-up process. DCVSL was first introduced in [42]. Lata, [16] 
desaïbed a design procedure for DCVSL circuits, and [17] presented a cornparison between 
conventional and DCVSL fdl-adder circuits. DCVSL is, especidy, very efficient in design- 
ing Ml-adders [73]. DCVSL gates are used in asynchronous circuits as completion signal 
detectors [6l]. 
The switching activity of a CPL gate is entirely controlled by a netaork of NMOS han- 
sistors. A CPL gate nwally uses s minimtm size PMOS latch to avoid excessive static 
CHAPTER 1. INTRODUCTION 10 
energy dissipation. CPL was f is t  introduced in [105], where it was incorporated to design 
a fast multiplier circuit. Design issues regarding CPL circuits are discussed in [69]. Double 
Pass-transistor Logic (DPL) is another differential CMOS logic style, which has a similar 
structure to CPL. Compared to CPL, DPL doubles the number of transistors to achieve 
higher performance [94]. We study CPL, DPL, and similar gate topologies under the cat- 
egory of pass-transistor logic (PTL). By PTL we refer to the general category of CMOS 
circuits in which signals, not necessarily VDD or ground, are passed from their inputs to 
their outputs through a chah of MOS transistors. 
1.5 Scope of Thesis Based on Abstraction Levels 
and Related Work 
Digital circuits are s tudied at different abstraction levels. Commonly used abstraction levels 
in digital circuits are, in order of increasing abstraction, the device, switch (or circuit), logic 
(or gate), module (or functiond block), and system levels. Although there is a generally 
shared understanding of these abstraction levels, there is no consensus on a precise definition 
for each level. As indicated, more t han one term rnay rder to the same abstraction level. We 
have noted our preference by enclosing the terms we h d  more ambiguous wit hin parentheses. 
For example, we prefer the term switch level over circuit level, becanse circuit is a very general 
term and may have different meanings depending on the context. We also like to avoid nsing 
the terxn gate level, because gate refers to one of the three terminals of an MOS transistor 
as well as a logic circuit primitive. 
The scope of the modeling and optimization technique presented in this work covers a 
few abstraction levels extending fkom the device level to the module leveL Neverthelas, the 
stady of the details is not equdy distributed among the levels, and most of the effort is 
foeused at the logic level. This section outlines our method related to the device, switch, 
logic, and module levels of abstraction. Our contribution at each abstraction level is dao 
stated. We have followed a bottom-up approach in our modeling, mch that the r e d t s  
obtained at one level are incorporated into the next higha abstraction level. Figure 1.3 
illustrates the concepts studied at each abstraction level. The reader is asked to refer to the 
A i 8- 0- 
6 7, g TT 
Rising Delay Falling Delay Falling Delay Rising Delay 
I I  I 
MOS Transistors Connected in Series 
Input Slope EHect Overlapping Currents Opposing Currents 





Conv. DCVSL PTL 
Mixed Logic Styles Circuits with Feedback 
Figare 1.3: Scope of onr modeling technique presented in this thesis. 
CHAPTER 1. INTRODUCTION 
correspondhg part of this figure when scanning the following subsections. 
1.5.1 Device Level 
The basic delay equation (1.3) suggests that accurate estimation of the curent and capaci- 
tances is essential in predicting the delay. Both of these delay components are related to the 
device level of abstraction. 
In a delay calculation, usually the saturation m e n t  of an MOS transistor is of concem. 
With the advent of IC technology, resulting in small device geometries, the conventional 
MOSFET model of Shockley [87] is no longer valid. Shockley's model doesn't include short- 
channel effects, such as velocity saturation and mobility degradation; hence, new MOSFET 
models have been proposed. A modüied version of Shockley's MOSFET model, which ac- 
counts for channel length modulation, is used in SPICE as the elementary Level 1 MOSFET 
model. Higher level SPICE MOSFET models, induding Level 13 used in our simulations, 
are based on the semi-empirical model known as BSIM [86]. This model is too complicated 
for hand calculation and intuitive analysis. A simpler model is proposed in 1951 that gives 
an intuitive underst anding of a number of paramet ers. Sakarai and Newton have developed 
the so called a-power law for short-channe1 MOSFETs [76-781. The a-power rnodel is p o p  
ular due to its simplicity and similarity to Shockley's model. Motivated by the considerable 
mismatch between HSPICE simulations and the a-power law, we introduce another model 
for the saturation cunent of submiaon MOS transistors. The new model shows a high pre- 
cision in reflecting short-channe1 dects.  Our MOSFET saturation current model uses three 
empirical parameters, which are extracted with HSPICE simulations. 
MOSFET capacitances consist of several types and can be categoriaed into the total 
effective gate capautance and the total effective capacitances at the source and the drain. AU 
of these capacitances are non-linear fanctions of the bias voltage. Therefore, the valaes for a 
rising transition are diffkrent fiom a f a g  transition. Although there are simple expressions 
for calcnlating these capacitances, they are not very accurate [72,102]. The more accurate 
expressions, however, are complicated and suitable for CAD tools. We present a method 
for extracting these capocitances for rising and falling transitions with the aid of HSPICE 
simulations. This assures that the average &ect of junction ovetlap capacitances is also 
CHAPTER 1. INTRODUCTION 
taken into account. 
1.5.2 Switch Level 
At the switch levd we study the switching delay of a number of transistor circuit scenarios 
that are frequently encountered in CMOS logic styles. Most importantly, in order to cover 
different CMOS logic styles we d e h e  and accommodate four types of switching delays: 1) 
Rsing delay through a PMOS transistor, 2) Falling delay through an NMOS transistor, 3) 
Falling delay through a PMOS transistor, and 4) Rising delay through an NMOS transistor. 
Expressions for these delays are derived by applying the curent model, developed at the 
device level, to esch case. The distinction between these delays and the consequent modeling 
of each bas not been addressed in the literature before, 
Another concem at this level is the dday through MOS transistors comected in series. 
The usual practice is to use the well hown theory of Ehore [35] to approximate the delay . 
in such cases. First , we present an RC modd of serially connected MOS transistors. Then, 
we apply Elmore's theory in conjunction with some simulation results to corne up with a new 
expression for the delay of these structures. The delay expression has two components; one 
related to the capacitances of the internal nodes and one related to the output capacitance. 
A particdarly important conclusion of this part of the investigation is that the delay opti- 
mization of similarly sized, serially connected MOS transistors is independent of the internal 
capacit ances. 
The delay of a gate is affected by the finite slope of its input signal. To sixnpiify the delay 
modeling, the input signal is usady approximated by a ramp waveform. Hedenstierna and 
Jeppson [40] were the first to suggest that if the input slope exceeds onethird of the output 
slope, then the delay of a gate can be obtained by adding a fiaction of the input rise or f d  
time to the step delay of the gate. This idea ha9 been welI received by circuit designers and 
CAD tool developers. The delay expression derived in [40] is based on the characteristics 
of long channe1 MOS devices which are now outdated. Saka i  and Newton (761 genaaüzed 
that delay expression to comply with the behavionr of today's short-charinel MOS devices. 
We demonstrate the extension of this concept to the four delay types introdaced earlier. 
CHAPTER 1 .  INTROD UCTION 14 
The combination of the step delay, input dope factor, and the d e c t  of serial connec- 
tions suffices for the delay modeling of conventionai CMOS gates. In order to cover non- 
conventional CMOS gates as wd, we introduce additional delay models for the cases where 
the currents of two branches in a circuit overlap at a node or oppose each other. These cases 
are schematicdy illustrated in Figure 1.3. A prominent example of the overlapping currents 
situation is a CMOS transmission gate. The case of opposing currents, on the other hand, 
is most evident in DCVSL gates. 
The majority of the modeling, optimization, and application work presented in this thesis 
concerns the logic level of abstraction. We present two methods of modeling and opti- 
rnization at this level: One which employs simplifying approximations and another which is 
considerably more accurate and rigorous. The former is res tricted to conventional CMOS 
gates, and its application to modeling and comparing conventional CMOS implementations 
of an important asynchronous circuits primitive, the C-element, is detailed. The latter is 
more comprehensive, because it is systematically developed based on the previous levels of 
abstraction. This forxnulation is applied to optimization and compsrison of conventional and 
unconventional CMOS implementations of the XOR gate, which is a widely used primitive in 
asynchronous and synchronous circuits. 
Many related publications at this level have treated the charactezization of the conven- 
tiond CMOS inverter in detail; for some recent ones the reader may refer to [7,23,31]. Often, 
it is not cleat how these delay models can be generalized to cover more complex conventional 
CMOS gates. Separate papers, like [66], have dealt with the issue of generating an inverter 
equivalent for conventional CMOS gates. In addition, most of the pablished delay models 
express the delay in terms of a load capacitance fiom which the dect  of changi.  the sizes 
of the transistors is not immediately clear to a designer. Circnit optimîzation work that is 
speafically related to the logic level of abstraction is mainly concerned with optimal baffes 
design for driving large loads [l5,53,98]. We demonstrate that this concept is a special case 
of oar general formaktion. In contrant to conventional CMOS logic, the literature seems to 
la& any substantial work dealing with the delay modehg of unconventional CMOS iogic 
CHAPTER 1. INTRODUCTION 15 
styles. Therefore, one part of our contribution at the logic level can be summarized as fol- 
lows: developing delay models for CMOS logic styles, including unconventional ones, t hat 
are explicit in terms of the width of the transistors and that lead to dosed-form formulas for 
the delay optimisation of CMOS gates implemented in any style. The formulas relate the 
optimal transistor sizing of a gate to the size of its succeeding, loading gate and the size of 
its preceding, driving gate. 
Another part of our work at the logic level concems developing methodologies for per- 
forming a fair comparison between different implementations of a logic gate. These method- 
ologies are exemplified by applying them to the implementations of the C-element and the 
XOR gate. The message is that, since the pedormance of a gate is characterized by its o p  
erating environment, one should optirnize the transistor sizing of each implementation for 
the given envkonment before evaluating and comparing their performance. This has not 
been strictly observed in the literature, even in recent publications such as [109]. Hence, one 
may question the credibility of the resdts and the suggestions reported in papers comparing 
difFerent logic styles [18,52,59,94,105]. The difficuity of perfoming sach a fair comparison 
is t hat it requires exhaustive simulations, since the available op timization tools usually do 
not handle unconventional CMOS logic styles. Our formulation greatly faditates this task. 
When examining the CMOS implementations of the C-element, we realized that all re- 
ported designs belonged to the conventional family. We introduce a differential CMOS logic 
style (DL) which is shnilar to DCVSL, but uses an inverter latch instead of a PMOS latch. 
Thus, DIL has an inherent static memory. This property of DIL makes it suitable for im- 
plementing primitives like the Mement .  Part of this work discasses the cons and pros of 
the DIL C-element and a number of modified versions of it. 
The delay and energy for DCVSL and DE gates are affected by the race between the puil- 
np PMOS transistor and the pd-down NMOS netwotk during output switehing. Neglecting 
the race problem resdts in considerable underestimation of the delay and energy. A method 
of calculating the delay and energy of DCVSL gates that captares the race problem had not 
been reported before. A similar type of race problem also exists in conventional CMOS logic 
styles asing an inverter latch at the output. This is also addressed and treated in the thesis. 
CHAPTER 1. 1NTRODUCTION 
1.5.4 Module Level 
Delay modeling and optimal transistor sizing eventudy targets the module and system 
levels of abstract. Circuit optimization at this levd has been actively studied since the late 
1970's [75] for over two decades [68]. A number of tediniques from the 1980's express the 
delay with posynomials and solve them using geornetnc programming or heuristic approaches 
[36], [go]. However, the delay models in these earlier works, including [75], [19], and (411, 
do not accommodate the efEect of the finite waveform slopes. Gate sizing is formulated 
as a nonlinear programming problem in [19] and [41]. In [45], although the delay models 
include the input dope effect, the accuracy of the optimization is somewhat comprornised 
by assuming a fked size ratio of PMOS transistors to NMOS transistors within a gate. A 
convex programming formulation solved by an interior point method is used for the problem 
of gate sizing in [79]. In [25], a gate sizing algorithm is developed based on a table lookup 
nonlinear delay model. More recently, an approach for minimizing total power dissipation 
under delay constraint is presented in [68]. In general, the literature on this subject is 
confined to conventional CM0 S logic styles and evolves aroand the method of mat hematicd 
progamming, which seems inevitable for constrained delay optirniration. 
This thesis follows a different approach in delay estimation and optimal transistor sizing 
at the module level. The technique evolves naturally fiom o u  delay models and op timization 
formulas cleveloped ab the logic level. At the logic level, we have derived a set of delay models 
and optimal transistor sizing formulas for the popular CMOS logic styles. This enables us 
to deal with modules consisting of mixed logic styles. For delay estimation dong a path, 
we simply add the delay equations for the gates on that path. For delay optimization dong 
a critical path, we solve a set of nonlinear equations by iteration. This method is more 
convenient, faoter, and more intuitive thaa mathematical programming. Fitrthermore, it 
doesn't show any convergence problem. We investigate the optimization of modules with 
and without feedback loops. Loop structures play a significant role in asynchronous pipelines. 
We should mention that the only work which, to some extent, shares OUI ontlook is the 
so called theory of Iogical effort by Sutherland and Sprod  [91,93]. Their method, however, 
is limited to standard CMOS logic gates, uses a d o m  PMOS to NMûS size ratio for an 
logic gates, assumes rising and fslling delays are equal and, in general, does not support 
CHAPTER 1. INTRODUCTION 17 
branching within a path. Moreover, they use a delay mode1 which does not include signal 
dope effect and a short-channel view of series-connected transis tors. 
A theory of delay optimization evolves from our work that states "the delay in a circuit 
consisting of conventional CMOS logic gates is minimal if for each stage along the mitical 
path of the circuit, the delay due to that stage as a load equals the delay through that 
stage as a drive". This theory also generally holds for logic gates which experience no cases 
of overlapping and opposing currents. We show that the theory is valid even when the 
input slope factor and the effect of serial connection of transistors are taken into account. 
Moreover, branches and spurious capacitances along the path do not void the theory. 
1.6 Thesis Overview 
The order at which the chapters of the tbesis appear follows the natural course of the devel- 
opment of this work, ra tha  than the abstraction levels presented in Section 1.5. Figure 1.4 
Uustrates the organization of the chapters in the thesis and indicates the relation between 
them. 
Chapter 2 is entirely devoted to an overview of asynchronous circuit design and covers 
motivations, models, met hodologies, and design techniques. Chapter 3 presents a formulation 
for quidt eduation and op timization of conventional CMOS circuits and its applications to 
an inverter chain, tapered b d k r  design, and generation of complementary signals. Chap ter 
4 applies the model derived in Chapter 3 to singlerail implementations of the Celement 
and identifies the most qualified implementation for a high-perfomance, enagy-efiicient 
design environment. Chapter 5 introduces the DIL logic style and compares the DIL C 
element and its rnodiâied versions with the conventional implementations of the Celement. 
The chapter demonstrates cases where an asynchronous pipeline using differentid C-element 
gates may fd due to the divergence of complementary signais. A more comprehensive, 
rigorous, and accurate delay modeling technique starts fiom Chapter 6, which introduces a 
model for the saturation m e n t  of MOS transistors. This chapter also deals with the delay 
modeling issues related to  the device and switch levels of abstraction. Chapter 7 incorporates 
the arguments of Chapter 6 to derive delay models and optimal transistor sizing formalas 
CHAPTER 1. INTRODUCTION 
Conclusion 
Delay Estimation Overview of 
and Optimization Asy nc hronous 
at the Module Level Circuits 
Delay Modeling and 
Optimization at the 
Logic Level 
Delay Modeling 
at the Device 










Figure 1.4: Organization of the chap ters and their relations in the thesis. 
for various conventional and difkrential CMOS logic styles. The same chapter indudes a 
cornparison between the optimized CMOS implementations of the XOR gate. Chapter 8 
presents an optimization technique for modules with and without feedback loops. Chapter 




Digital VLSI circuits are usually dassiiied into synchronous and asynchronous circuits. Syn- 
chronous circuits are generally controlled by global synchronization signals provided by a 
dock. Asynchronous circuits, on the ot her hand, do not use such global synchronization sig- 
nal~.  Between these extremes there are various hybrids. Digital circuits in today's comma- 
cial products are h o s t  exclusively synchronous. Despite this big difference in popularity, 
there are a number of reasons why asynchronous circuits are of interest. 
In this chapter, we present a brief overview of asynchronous circuits '. First we address 
some of the motivations for designing asynchronous &cuits. Then, we discnss different 
classes of asynchronous c i r a i  ts and briefly explain some asynchronous design met hodologies. 
Findy, we present a typical asynchronous design in detail. 
2.1 Motivations for Asynchronous Circuits 
Throughout the years researchers have had varions reasons for studyhg and bdding asyn- 
chronous circuits. Some of the often mentioned advantages of asynchronous circuits are 
speed, low energy dissipation, modular design, irnmunity to met astable behavior, fieedom 
fiom dock skew, and Iow generation of and low susceptibilîty to electromagnetic interfer- 
IThis chapter is based on o u  invited article on the topic of asyndironous circuits for the Encyclopedia 
of Electrical and Electronics Engineering published by John Wdey [84). 
CHAPTER 2. ASYNCHRONO US CIRCUITS 
ence. We elaborate here on some of these potentials and indicate when they have been 
demons trated t hrough comparative case s t udies. 
2.1.1 Speed 
Speed has always been a motivation for designing asynchronous circuits. The main reasoning 
behind this advantage is t hat synchronous circuits exhibit worst-case behavior, whereas 
asynchronous circuits exhibit average-case behavior. The speed of a synchronous circuit is 
governed by its clock fkequency. The dock period should be large enough to accommodate 
the worst-case propagation delay in the critical path of the circuit, the maximum dock skew, 
and a safety factor due to fluctuations in the chip fabrication process, operating temperature, 
and supply voltage. Thus, synchronous circuits exhibit worst-case performance. This worst- 
case behavior is dictated by the global dock and, in spite of the fact that the worst-case 
propagation in many circuits, particularly arîthmetic units, is improbable and may be m c h  
longer t han the average-case propagation. 
Many asynchronous circuits are controlled by locai communications and are based on the 
principle of initiating a computation, waiting for its completion, and t hen initiating the next 
one. When a computation is completed early, the next computation can start early. For this 
reason, the speed of asynchronous circuits equipped with completion-detection mechanisms 
depend on the computation time of the data being processed, not t he  worst-case timing. Ac- 
cordingly, such asynchronous circuits exhibit average-case performance. An example of an 
asynchronous circuit where the average-case potential is nicely exploited is reported in [103], 
an asynchronous divider that is twice as fast as its synchronous counterpart. Nevertheless, 
to date, there are few concret e examples demonstrating t hat the average-case performance 
of asynchronous circuits is higher than that of synchronous circuits performing s d z u  h c -  
tions. The reason is that the average-case performance advantage is often connterbalanced 
by the overhead in control circnitry and completion-detection mechaaisms. 
Besides demonstrating the average-case potential, there are case stadies in which the 
speed of an agynchronous design is compared to the speed of a correspondhg synchronons 
version. Molnar et al. report a case study of an asynchronous FIFO that is every bit as fast 
as any synchronous FIFO asing the same data latches [64]. Furthamore, the asynchronoas 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 21 
FIFO has the additional benefits that it operates under local control and is easily expandable. 
At the end of this chapter we give an example of a FIFO with a slightly different control 
circuit . 
2.1.2 Immunity to Metastable Behavior 
Any circuit with a number of stable states also has metastable states. When such a cir- 
cuit gets into a metastable state, it can remain there for an indehite period of time before 
resolving into a stable state [12,54]. Metastable behavior occurs, for example, in circuit 
primitives that realize mutud exclusion between processes, called arbitera, and components 
that synchronize independent signals of a system, called synchronizers. Although the prob- 
ability that metastable behavior lasts longer than period t decreases exponentidy with t, 
it is possible that met astable behavior in a synchronous circuit lasts longer than one dock 
period. Consequently, when metastable behavior occurs in a synchronous circuit: erroneous 
data may be sampled at the the computation t h e  of the dock pulses. An asynchronous 
circuit deals gr acefdy wit h met as t able behavior by simply delaying the computat ion until 
the metastable behavior has disappeared and the element has resolved into a stable state. 
2.1.3 Modularity 
Modulafity in design is an advantage exploited by many asynchronous design styles. The 
basic idea is that an asynchronous system is composed of fanctional modules cornmuni- 
cating dong well-defined interfaces. Composing asynchronous systems is simply a matter 
of connecting the proper modules wit h matching interfacial specifications. The interfdal 
specifications describe only the sequences of events that can take place and do not speeify 
any restrictions on the timing of these events. This characteristic reduces the design time 
and complexity of an asynchronoas circuit, becaase the designer does not have to worry 
about the delays incwed in individaal modales or the delays insated by conneetion wires. 
Designers of synchronoas circuits, on the other hand, often pay considerable attention to 
satisfying the detailed interfacial timing specifications. 
Besides ease of composability, modalar design also has the potential for bet ter technology 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 22 
migration, ease of inmemental improvement, and reuse of modules [91]. Here the idea is 
that an asynchronous system adapts itself more easily to the advances in technology. The 
obsolete parts of an asynchronous system can be replaced with new parts to improve system 
performance. Synchronous systems cannot take advantage of new parts as easily, because 
they must be operated with the old dock frequency or other modules must be redesigned to 
operate at the new dock fiequency. 
One of the earliest projects t hat exploited modularity in designing asynchronous circuits 
is the Macromodules project [21]. Another nice example where modular design has been 
demonstrated is the TANGRAM compiler developed at Philips Research Laboratories [6]. 
2.1.4 Low Power 
Due to rapid growth in the use of portable equipment and the trend in high-performance 
processors towards unmanageable power dissipation, energy efficiency has become crucial 
in VLSI design. Asynchronous circuits are attractive for energy-escient designs, mainly 
because of the elimination of the dock. In systems with a global dock, all of the latches and 
registers operate and consume dynamic energy during each clock pulse, in spite of the fact 
that many of those latches and registers might not have new data to store. There is no such 
waste of energy in asynchronous circuits, because computations are initiated only when they 
need to be done. 
Two notable examples that demonstrated the potential of asynchronous circuits when in 
energy-efficient design are the work done at Philips Research Laboratories and at Manchesta 
University. The Philips group designed a fully asynchronous digital compact-cassette (DCC) 
error detector which consnmed 80% less energy than a similar synchronous version [4]. The 
AMULET group at Manchester University successfidly implemented an asynchronous version 
of the ARM microprocessor, one of the most energy-efficient synchronous microprocessors. 
The asynchronons version achieved a power dissipation comparable to the fourth grneration 
of ARM, aromd 150 mW [37], in a similar technology. 
Recently, power management techniques are being nsed in synchronous systems to tarn 
the dock on and off conditiondy. However, these techniques are only worthwhile imple- 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 23 
menting at the level of functional units or higher. Besides, the components that monitor the 
environment for switehing the clock continue dissipating energy. 
It is also worth mentioning that unlike synchronous circuits, most asynchronous circuits 
do not waste energy on hazards, which are spurious changes in a signal. Asynchronous 
circuits are essentidy designed to be hazard-free. Hazards can be responsible for up to 40% 
of energy loss in synchronous circuits [Il]. 
2J.5 Freedom f h m  Clock Skew 
Because asynchronous circuits generdy do not have clocks, they do not have many of the 
problems associated with clocks. One such problem is clock skew, the technical term for 
the maximum difference in clock arrival time at different parts of a circuit. In synchronous 
circuits, it is aucial that all modules opaating with a cornmon clock receive this signal 
simultaneously, that is, within a tolerable period of time. MinMizing clock skew is a difficult 
problem for large circuits. Various techniques have been proposed to control clock skew, but 
generally they are expensive in t m s  of silicon area and energy dissipation. For instance, the 
clock distribution network of the DEC Alpha, a 200 MHz miuoprocessor at a 3.3 V supply, 
occupies 10% of the chip area and ha3 a 40% share in the total chip power consumption [30]. 
Although asynchronous circuits do not have the clock skew problem, they have their own 
set of problems in rninirnizing the overhead needed for synchronization among the parts. 
2.2 Models and Met hodologies 
There are many models and methodologies for anaiylring and designing asynchronous circuits. 
Asynchronous circuits can be categorized by the fouowing criteria: signaling protocol and 
data encoding, underlying delay model, mode of operation, and formalism for speafying and 
designing circuits. This section presents an idormal explanation of these criteria. 
CHAPTER 2. ASMVCHRONOUS CIRCUITS 
2.2.1 Signaling Protocols and Data Encodings 
Modules in an asynchronous circuit communicate data with some signahg protocol con- 
sisting of request and acknowledgment signals. There are two common signaling protocols 
for communicating data between a sender and a receiver: the four-phase and the tw-phase 
protocol. In addition to the signaling protocol, there are different ways to encode data. The 
most common encodings are single-rail and dual-rail encoding. We explain the two signaling 
pro tocols firs t and then discuss the data encodings. 
If the sender and receiver communicate t hrough a tuo-phase signaling protocol, then each 
communication cycle has two distinct phases. The first phase consists of a request initiated 
by the sender. The second phase consists of an acknowledgment by the receiver. The request 
and adrnowledgment signals are often implemented by voltage transitions on separate wires. 
No distinction is made between the directions of voltage transitions. Both rising and f a g  
transitions denote a signaling event . 
The four-phase signaling protocol consists of four phases: a request followed by an ac- 
knowledgment, foilowed by a second request, and finally a second acknowledgment. If the 
request and acknowledgment are implemented by voltage transitions, then at the end of 
every four phases, the signaling wires return to the same voltage levels as at  the start of the 
four phases. Because the initial voltage is usndy zero, this type of signalhg is also called 
return-to-zero signaling. Other names for two-phase and four-phase signaling are two-cycle 
and four-cycle signaling, respectively, or transition and leuel signalàng, respectively. 
Both signaling protocols can be used with single and dual-rail data encodings. In single- 
rail data encoding each bit is encoded with one wire, whereas in dual-rail encoding, each bit 
is encoded with two d e s .  
In single-rail encoding, the value of the bit is represented by the voltage on the data 
wire. When communicating n data bits with a single-rail encoding, during periods where 
the data arires are guaranteed to rem& stable, we Say that the data are valid During periods 
where the data wires are possibly changing, we Say the data are invalid A huephase or 
four-phase signaling protocol is used to tell the receiver when data are valid or invalid . The 
sender idorms the receiver about the validity of the data through the request signal? and the 
receiver, in tum, informs the sender of the receipt of the data throtzgh the acknowledgment 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 25 
signal. Therefore, to communicate n bits of data, a total number of (n + 2) wires are necessary 
between the sender and the receiver. The connection pattern for singlerail encoding and 
two or four-phase signaling is depicted in Figure 24a). 
(a) Bundled Data Convention (b) Dual-Rail Data Encoding 
f \ f 5 







Figue 2.2(a) shows the sequence of events in a twephase signaling protocol. The events 
include the times when the data become valid and invalid. The transparent bars indicate 
the period when data are valid, during the other periods, data are invalid. Notice that a 
request signal occurs only after data become valid. This is an important timing restriction 
associated wit h t hese communication protocols, namely, the request signal that indicates 
that data are valid should always arrive at the receiver aber ail data wires have attained 
 the^ proper value. The restriction is referred to as the bundling constraint. For this reason 
the communication protocol is often called the bundled data protocol Figure 2.2(b) shows a 
sequence of events in a four-phase protocol and single-rail data encoding. Other sequences 






/ DATA E 
1 1 
V - E 
The dual-rd encoding scheme uses two wires for every data bit. There are several dual- 
rail encoding schemes. All combine the data encoding and signaling protocol. There is 
no expliut request signal, and the dual-rail encoding schemes all require (2n + 1) wires as 
illustrated in Figure 2.l(b). In the case of four-phase signaling, there are several encodings 
that can be nsed to transmit a data bit. The most common encoding has the following 
meaning for the four states in which each pair of ces can be in: 00 = reset, 10 = valid 0, 
01 = valid 1, and 11 is an nnused state. Every pair of wires has to go throngh the reset state 
bdore becoming valid again. In the first phase of the four-phase signalhg protocol, every 
pair of wires leaves the reset state for a valid O or 1 state. The receiver detects the arrival of a 
L i A ckno w ledg e R 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 26 
Figure 2.2: Data transfer in two-phase signaling (a), and four-phase signaling (b) 
new set of valid data when ail pairs of wires have left the reset state. This detection replaces 
an explicit request signal. The second phase consists of an acknowledgment to inform the 
sender that data has been consurned. The third phase consists of the reset of all pairs of 
wires to the reset state, and the fourth phase is the reset of the acknowledgment. 
In a two-phase signaling protocol, a diffaent dual-rail encoding is used. An example of 
an encoding is as follows. Each pair of wires has one wire associated with a O and one wire 
associated with a 1. A transition on the wire associated with O represenks the communication 
of a O, whereas a transition on the other wire represents a communication of a 1. Thus, a 
transition on one wîre of each pair signals the arrival of a new bit value. A transition on 
both wires is not allowed. In the f i s t  phase of the twwphase signaling protocol, every pair of 
wires communicates a O or a 1. The second phase is an acknowledgment sent by the receiver. 
Of all data encodings and signaling protocols, the most popular are the single-rail encod- 
h g  and four-phase signaling protocol. The main advantages of these protocols are the small 
number of connection wires and the simplicity of the encoding, which dows using conven- 
tiond techniques for imple~nenting data operations. The disadvantages of these protocols 
are the bundIing constraints that must be satisfied and the extra energy and time wasted 
in the additional two phases compared with two-phase signaling. Dual-rail data encodings 
have been used to col~~municate data in asynehronons circuits fiee of any timing constraints. 
Dual-rail encodings, however, are expensive in practice, because of the many interconnec- 
tion wires, the extra circuitry to detect completion of a transfa, and the diflidty in data 
processing. 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 
2.2.2 Delay Models 
An important charac t eris tic dis t inguishing difFerent as ynchronous circuit styles is the delay 
mode1 on which they are based. For each circuit primitive, gate or wire, a delay model 
stipulates the sort of delay it imposes and the range of the delays. Delay models are needed 
to analyze all possible behavior of a circuit for various correctness conditions, like the absence 
of hazards. 
A circuit is composed of gates and interconnection wires, ail of which impose delays on 
the signals propagating though them. The delay models are categorized into two classes: 
pure delay models and inertial delay models. In a pure dehy model, the delay associated with 
a circuit component produces only a t h e  shift in the voltage transitions. In reality, a circuit 
component may shift the signals and also filter out pulses of small width. A delay model 
which captures this fact is called an inertial delay model. Both classes of delay models can 
have several ranges for the delay sliifts. We distinguish the zero-delay, fized-delay, bounded- 
delay, and unbounded-delay models. In the zerwdelay model, the values of the delays are 
zero. In the fixed-delay model, the values of the delays are constant, whereas in the bounded- 
delay model the values of the delays vary within a bounded range. The unbounded-delay 
model does not impose any restriction on the value of the delays except that they cannot 
be infinite. Sometimes two different delay models are assumed for the wires and the gates 
in an aspchronous circuit. For example, the operation of a dass of asynchronous circuits is 
based on the zeredelay model for wires and the unbounded-delay model for gates. Formal 
definitions of the various delay modeh are given in [IO]. 
A concept closely related to the delay model of a circuit is its mode of opeîation. The 
mode of operation characterizes the interaction between a circuit and its environment. Clas- 
sical asynchronous circuits operate in the fundamental mode [58,97], which assumes that the 
environment changes only one input signal and waits mti l  the circuit seaches a stable state. 
Then the environment is allowed to apply the next change to one of the input signalS. Many 
modern asynchronous circuits operate in the input-output mode. In contrast to the h d a -  
mental mode, the input-output mode d o w s  for input changes immediatdy aRa receiving 
an appropriate response to a previous input change, even if the entire circuit has not yet 
stabilized. The fandamental mode was introduced in the sixties to simplify the analysis and 
CHAPTER 2. ASYNCHRONO US CIRCUITS 28 
design of gate circuits with Boolean algebra. The input-output mode evolved in the eighties 
from event-based formalisms to describe modular design methods that abstracted from the 
interna1 operation of a circuit. 
2.2.3 Forrnalisms 
Just as in any other design discipline, designers of asynchronous circuits use various for- 
malisms to master the cornplexities in the design and analysis of their artifacts. The for- 
malisms used in asynchronous circuit design can be categorized into two classes: formalisms 
based on Boolean algebra and forrnalisms based on sequences of events. Most design method- 
ologies in asynchronous circuits use some mixture of both formalisms. 
The design of many asynchronous circuits is based on Boolean algebra or its derivative 
swit ching t heory. Su& circuits often use the fundamental mode of operation, the bounded- 
delay model, and have, as primitive elements, gates that correspond to the basic logic func- 
tions, Like AND, OR, and inversion. These formalisms are convenient for hplementing logic 
functions, analyzing circuits for the presence of hazards, and synthesizing fundamental-mode 
circuits [IO, 971. 
Event-based formalisms deal with sequences of events rather than binary logic variables. 
Circuits designed wi th an event-based formalism operate in the input-output mode, un- 
der an unbounded-delay model, and have, as primitive elements, the JoIN, the TOGGLE, 
and the MERGE, for example. Event-based form&sms are particnlarly convenient for d e  
signing asynchronous circuits when a high degree of connvrency is involved. Several tools 
have been generated for the automatic ventication of asynchronous circuits with event-based 
formalisms [29,32]. Examples of event-based formalisms are 'hace Theory [3,34,99], DI 
Algebra (501, Petri nets, and Signal Tkansition Graphs [18,60]. 
2.3 Design Techniques 
This section introduces the most popular types of asynchronous &cuits and bridy describes 
some of th& design techniques. 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 
2.3.1 Types of Asynchronous Circuits 
There are special types of asynchronous circuits for which formal and informal specifications 
have been given. Here are b d  informal descriptions of some of them in a historical context. 
There are two types of logic circuits: combinational and sequential. The output of a 
combinational circuit depends only on the current inputs, whereas the output of a sequential 
circuit depends also on the previous sequences of the inputs. With this definition of a 
sequential circuit, almos t dl asynchronous circuit styles fd into this category. However, 
the term asynchronous sequential circuits or machines generdy refers to those asynchronous 
circuits based on finite state machines similar to those in synchronous sequential circuits 
[48,97]. 
Muller was the h s t  to give a rigorous formalization of a special type of circuits for which 
he coined the name speed-independent circuits. An account of this formalization is given 
in (63,651. Informally, a speed-independent circuit is a network of gates that satisfies its 
specification irrespective of any gate delays. 
Rom a design discipline that was developed as part of the Macromodules project [21] 
at Washington University in S t. Louis, the concept of another type of asynchronous &cuits 
evolved, which was given the name delay-insensitive circuit, that is, a network of modules 
that satisfies its speafication irrespective of any element and wire delays. It was reaüzed 
that proper formalization of this concept was needed to speQfy and design snch circuits in 
a well-defined manna. Such a formalization was given by Udding [96]. 
Another name fiequently used in designing asynchronous circuits is s e l f - t i i m  systems. 
This name ras  introduced by Seitz [SOI. A self-timed system is described recnrsively as 
either a self-timed element or a legal connection of self-timed systems. The idea is that 
self-timed elements can be implemented with their own timing discipline, and some may 
even have synchronous implementations. In composing self-timed systems from self'-timed 
elements, however, no teference to the timing of events is made; only the sequence of events 
is relevant. In other words, the elements &keep time to themselves." 
Some have found the unbounded gate-and-wire delay assumption, on which the concept 
of a delay-insensitive circuit is based, to be too restrictive in practice. For ewmple, the 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 30 
unbounded gate-and-wire delay assumption implies that a signal sent to multiple recipients 
by a fork can incur a different unbounded delay for each of the recipients. They proposed 
to relax this delay assumption slightly by using isochronic f o ~ h  (561. An isochronic fork is a 
fork whose difference in the delays of its branches is negligible compared with the delays in 
the elernent to which it is connected. A delay-insensitive circuit that uses isochronic forks 
is c d e d  a quasi-delay-insensitive circuit [3,56]. Although the use of isochronic forks gives 
more design freedom in exchange for less delay insensitivity, care has to be taken with its 
implement ation [2]. 
2.3.2 Asynchronous Sequential Machines 
The design of asynchronous sequential finite state machines was initiated with the pioneer- 
ing work of Huffman [48]. He proposed a structure similar to that of synchronous sequential 
circuits consisting of a combinational logic circuit, inputs, outputs, and state variables [97]. 
Huffman circuits, however, store the state variables in feedback loops containing delay eie- 
ments, instead of in latches or flipflops, as synchronous sequential circuits do. The design 
procedure begins with aeating a Jour table and reducing it through some stote minimiza- 
tion technique. After a state ussignment, the procedure obtains the Boolean expressions 
and implements them in combinational logic with the aid of a Iogic minimidion program. 
To parant  ee a hazard-fiee operation, Hufban cirait s adop t the restrictive single-input- 
change fundamental mode, that is, the environment changes only one input and waits until 
the circuit becornes stable befbre changing another input. This requirement can substantially 
degrade the cirait  perfomance. Hollaar realized this fact and introduced a new structure 
in which the fundamental mode assumption is relaxed [44]. In his implementation, the state 
variables are stored in NAND latches, so that inputs are allowed to change eariier than the 
fundamental mode would allow. Although H o h ' s  method improves the performance, it 
suffers from the danger of produchg hazards. Besides, neither technique seem to be adequate 
for designing concurrent systems. Modeh and algorithms for the analysis of aspchronoas 
sequential circuits have been developed by Brzooowski and Seger [IO]. 
The quest for more concurrency, higher performance, and hazard-free operation, resnlted 
in the formulation of a new generation of asynchronous sequential circuits knoum as burst- 
CHAPTER 2. ASYNCHRONO US CIRCUITS 31 
mode machines [22,26]. A buist-mode circuit does not react until the environnent performs 
a number of input changes cded  an input burst. The environment, in turn, is not dowed to 
introduce the next input burst until the circuit produces a number of outputs c d e d  an output 
burst. A state graph is used to spewfy the transitions caused by the input and output bursts. 
Two synthesis methods have been proposed and automated for implementing burst-mode 
circuits. The first method employs a locally generated clock to avoid some hazards [67]. The 
second method uses three-dimensional flow tables and is based on Huffman circuits [log]. 
One limitation of burst mode circuits is that they restrict concurrency within a burst. 
2.3.3 Speed-Independent Circuits and STG synt hesis 
Speed-independent circuits are usually designed by a form of Petri nets [Tl]. A popular ver- 
sion of Petri nets, signal transition graph9 (STG), wss introduced by Chu. He also developed 
a synthesis technique for transforming STGs into speed-independent circuits [18]. Chu's work 
was extended by Meng, who produced an STG-based tool for synthesizing speed-independent 
circuits from high-level specifications (611. In this technique, a circuit is composed of corn- 
putationd blocks and interconnection blocks. Computational blocks range fiom a simple 
shifter module to more complicated ones, such as ALUs, RAMs, and ROMs. Interconnec- 
tion blocks synchronise the operation of computational blocks by producing appropriate 
control sign&. Computational blodrs generate completion signals &a their output data 
become valid. The interconnection blocks use the completion signals to generate four-phase 
handshake protocols. 
2.3.4 Delay-Insensit ive Circuits and Compilation 
Several researchers have proposed techniques for designing delay-insensitive circuits. Eber- 
gen (331 has developed a synthesis method based on the fonnalism of f i c e  Weory. The 
method consists of specifjhg a component by a program and then trandorming this program 
into a delay-insensitive network of basic elements. The program notation dows speQfying 
pardel behavior. Ebergen's method has been applied to the design of s m d  components 
like stacks, various counters, and arbiters [34]. 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 32 
Martin proposes a method [56] that starts with a specification of an asynchronous circuit 
in a high-level progr amming language similar t O Hoare's Commvnicating Sequential Processes 
(CSP) [43]. An asynchronous circuit is spedied as a goup of processes communicating over 
channels. After various transfonnations, the program is mapped into a network of gates. 
This method led to the design of an asynchronous microprocessor [57] in 1989. Martin's 
met hod yields quasi-delay-insensitive circuits. 
Van Berkel[3] has designed a compiler based on a high-Ievel language c d e d  Tangram. A 
Tangram program also specifies a set of processes communicating over channels. A Tangram 
program is first translated into a handshake czrcuzt. Then these handshake circuits are 
mapped into various t arget architectures, depending on the dat a-encoding techniques or 
standard-cell libraries used. The translation is syntax-directed, which means that every 
operation occurring in a Tangram program corresponds to a primitive in the translated 
handshake circuit. This property is exploited by various tools that quickly estimate the 
area, performance, and energy dissipation of the final design by analyzing the Tangram 
program. Van Berkel's method also yields quasi-delay-insensitive circuits. 
Other translation methods fiom a CSP-like language to a (quasi-) delay-insensitive circuit 
can be found in [9,10 11. 
2.4 A Typical Asynchronous Design 
In this section we present a typical asynchronous design, a micropipeline (921. The circuit 
uses single41 encoding with the ho-phase signalhg protocol to communicate data between 
stages of the pipeline. The control circuit for the pipeline is a delay-insensitive circuit. First 
we present the primitives for the control circuit, then we present the latches that store the 
data, and finaily we present the complete design. 
2.4.1 The Control Primitives 
Figure 2.3 shows a few simple primitives used in event-based design styles. The schematic 
symbol for each primitive is depicted opposite its name. 
CHAPTER 2. ASYNCHRONO US CIRCUITS 
WIRE a r * b 




Figure 2.3: Some delay-insensitive primitives 
The simplest primitive is the WIRE. a tw+terminal element that produces an output 
event on its output terminal b after every input event on its input terminal a. Input and 
output events in a WIRE must alternate. An input event o must be followed by an output 
event b before another event o occurs. A WIRE is physically reakable with a wire, and events 
are implemented by voltage transitions. An initialized WIRE, o r  IWIRE, is very s h d a r  to a 
WIRE, except that it starts by prodncing an output event b instead of accepting an input 
event a; after this, its behavior exactly resembles that of a WIRE. 
The primitive for synchronization is the JOIN, also called the RENDEZVOUS (211. A JotN 
has two inputs a and b and one output c. The JOIN perfolplg the AND operation of two events 
a and b. It produces an output event c only after both of its inputs, a and b, received an 
event. The inputs can change again aRer an output is produced. A JOIN cari be implemented 
by a Muller C-element, explained in the next section. 
The MERGE component pdorms the OR operation of two events. If a MERGE component 
receives an event on either of its inputs, a or b, it produces an output event c mer an input 
event, there must be an outpat event; successive input events are not allowed. A MERGE 
CHAPTER 2. ASYNCHRONO US CIRCUITS 34 
can be implemented by a XOR gate. 
The TOGGLE has a single input a and two outputs 6 and c. After an event on input a, an 
event occurs on output b. The next event on a results in a transition on output c. An input 
event must be followed by an output event before another input event can occur. Thus, 
output events alternate or toggle after each input event. The dot in the TOGGLE schematic 
indicates the output which produces the first event. 
2.4.2 Storage Primitives 
Now we discuss two event-controiled latches due to Sutherland [92], as depicted in Figure 2.4. 
Their operation is managed through two input control signais: capture and pas,  labeled c 
and p respectively. They also have two output control signals: capture done, cd, and pass 
done, pd. The input data is labeled D, and the output data is labeled Q. Implementation 
Figure 2.4: Two event-driven lat ch implement ations 
(a) is composed of three secalled double-throw switches. Implementation (b) inclades a 
MERGE, a TOGGLE, and a levd-controlled latch consisting of a doable+throw switch and an 
inverter. A donbbthrow switch is schernatically represented by an inverter and a switching 
tail. The tail toggles between two positions based on the Iogic valne of a controIling signal. 
A double-throw switch, in fact, is a tweinput maltiplexer that produces an inverted version 
of its selected input. A CMOS implementation of the double-throw switch is shown in 
Figure 2.5 [92]. The position of the switch corresponds to the state where c is low. 
CHAPTER 2. ASYNCHRONOUS CIRCUITS 
Figure 2.5: A CMOS implementation of a double-throw switch 
An event-controlled latch can assume two states: transparent and opaque. In the trans- 
parent state no data is latched, but the output replicates the input, because a path of two 
inverting stages exists between the input and the output. In the opaque state, this path is 
disconnected so that the input data can change without affecthg the output; the current 
data at the output, however, is latched. Implementations in Figures 2.4(a) and 2.4(b) are 
both shown in their initial transparent states. The capture and pass signals in an event- 
controlled latch always altemate. Upon a transition on c, the latch captures the carrent 
input data and becomes opaque. The following transition on cd is an acknowledgment to 
the data provider that the current data has been captured and that the input data can be 
changed safely. A subsequent transition on p retnrns the latch badc to its transparent state 
to p a s  the next data to its output. The p signal is acknowledged by a transition on pd. 
Notice that in implementation (a) of Figure 2.4, signals cd and pd are merely delayed and 
possibly amplified versions of c and p, respectively. 
A groap of event-controfled latches, similar to implementation (a) of Figure 2.4, can be 
comected, sharing a capture wire and a pass wke, to form an event-controued register of 
arbitrary data width. Implementation (b) of Figure 2.4 can be generalized similarly into a 
register by imerting additional level-controued latches between the MERCE and the TOGGLE. 
A cornparison of different micropipeline latches is reported in [28] and later in [IO?]. 
CHAPTER 2. ASWCHRONOUS CIRCUITS 
2.4.3 Pipelining 
Pipelining is a powerful technique for constructing high-performance processors. Micropipelines 
are elegant asynchronous circuits that have gained much attention in the asynchronous com- 
munit y. Many VLSI circuits based on micropipeLes have been successfully fabricat ed. The 
AMULET microprocessor [37] is one example. 
The simplest form of a micropipeline is a FIFO. A four-stage FIFO is shown in Figure 2.6. 
It has a control circuit composed solely of intercomected JoiNs and a data path of event- 
controlled registers. The control signals are indicated by dashed lines. The thick arrows 
show the direction of data flow. Data is implemented with single-rail encoding, and the data 
path is as wide as the registers can accommodate. Adjacent stages of the FIFO communicate 
through a two-phase, bundled-data signaling protocol. This means that a request arrive at 
the next stage only when the data for that stage becomes valid. A bubble at the input of 
a J o I N  is a shorthand for a J o I N  with an IWlRE on that input. It implies that, initidy, 
an event has already occurred on the input with the bubble, and the JOIN can produce an 
output event immediately upon receiving an event on the other input. 
Din 
Ain rl a2 r3 Aout 
Figure 2.6: A font-stage micropipeline FIFO strncture 
Initially, al1 control wires of the FIFO are at low voltage, and the data in the registers are 
not valid. The FIFO is activated by a rising transition on &, which indicôtes that input 
data is valid. Subsequently, the fmt-stage JOIN produces a rising output transition. This 
CHAPTER 2. ASYNCHRONO US CIRCUITS 37 
signal is a request to the first-stage register to capture the data and become opaque. After 
capturing the data, the register produces a rising transition on its cd output terminal. This 
causes a transition on Ain and a transition on rl, which is a request to the second stage of 
the FIFO. Meanwhile, the data has proceeded t o  the second-stage register and has arrived 
there before the transition on T 1 occurs. If the environment does not send any new data, 
the h s t  stage remains idle, and the data and the request signals propagate further to the 
right. Notice that each time the data is captured by a stage, an acknowledgment is sent 
back to the previous stage which causes its latch to become transparent again. When the 
data has propagated to the last register, it is stored and a request signal ROut is forwarded 
to the consumer of the FIFO. At this point, all control signals are at high voltage except 
for AOut. If the data is not removed out of the FIFO, that is, AOut remains low, the next 
data corning from the producer advance only up to the third-stage register, because the 
fourth-stage JOIN cannot produce an output. Finslly, AWt &O becomes high when the 
consumer acknowledges receipt of the data. Further data storage and removal follows the 
same pattern. The operation of each JOIN can be interpreted as follows. If the previous 
stage has sent a request for data capture and the present stage is empty, then send a signal 
to capture the data in the present stage. 
Din 
Figure 2.7: A general four-stage micropipeline structure 
The FIFO can be modified easily to include data processing. A four-stage micropipeline, 
in its generd form, is illtlstrated in Figure 2.7. Now the data path consists of alternately 
positioned event-driva registers and combinational logie circuits. The event-driven registers 
CHAPTER 2. ASYNCHRONO US CIRCUITS 38 
store the input and output data of the combinational circuits, and the combinational cir- 
cuits perform the necessary data processing. To satisfy the data bundling constraint, delay 
elements may occasionally be required to slow down the propagation of the request signals. 
A delay element must at least match the delay through its corresponding combinational 
logic circuit, either by some completion detection mechanism or through the insertion of a 
wors t-case delay. 
A micropipeline FIFO is flexible in the number of data items it bders .  There is no 
restriction on the rate at which data enters or exits the micropipeline, except for the delays 
imposed by the circuit elements. That is why this FIFO and micropipelines generdy, are 
termed elastic. Ln contrast, in an ordinary synchronous pipeline, the rates at which data enter 
and exit the pipeline are the same, dictated by the external clock signal. A micropipeline 
is dso  flexible in the amount of energy it dissipates, which is proportional to the number 
of data movements. A clocked pipeline, however, continuously dissipates eneqy as if all 
stages of the pipeline capture and pass data all the tirne. Another attractive feature of a 
micropipeline is that it automatically shuts off when there is no activity. A docked pipeline, 
on the other hand, requires a special clock management mechanism to implement this feature. 
This sensing meehanism, however, constantly consumes energy, because it should never go 
ide. 
2.5 Concluding Remarks 
We have touched only on a few topics relevant to the area of asynchronous circuits and 
omitted many others. Among the topics omitted are the important areas of verification, 
testing, and performance anaiysis of asynchronons cirmits. W e  hope, however, that withm 
the scope of these pages we have provided enough information for hirther readings. For more 
information on asynchronous circuits, please see [27], [IO], or [39]. A comptehensive bibliog- 
raphy of asynchronous circuits can be fonnd ia [IO]. Upto-date information on research in 
asynchronous circuit design can be found at [38]. 
Chapter 3 
A Formulation for 
Quick Evaluation and Opt imizat ion of 
CMOS Logic Circuits 
From a digital circuit designer's point of view, the design parameters are often Illriited 
to the widths of the transistors. The supply voltage is usually fixed by the management 
and the threshold voltage is dictated by the technology. Just recently, altering the supply 
voltage wi t hin the chip and multi- t hreshold voltage technologies for saving energy are being 
practiced. But the former produces overhead and the latter is costly. Although literature is 
abundant with models for the performance of digital CMOS gates (e.g. [1,31,40,49,76,90]), 
they are often too complicated for quick hand-analysis and optimization. Most of these 
models express the delay in terms of a load capautance, fÎom which the efFect of changing 
the sizes of the transistors is not immediately dear. 
The purpose of this chapterl is to introduce a first-orda for the delay and energy of 
CMOS gates explicitly expressed in terms of the widths of the transistors. We demonstrate 
the convenience of using this model and report some important results obtained through 
direct applications of the model. We present simple expressions for optimum transietor 
sizing for minimiaing the rising delay, falling delay, total delay, and energy-delay prodnct of 
'The content of this chapter appeam in (851. 
CHAPTER 3. QUICK EVALUATION AND OPTIMIZATION OF CMOS CIRCUITS 40 
standard CMOS logic circuits. To keep the model simple, we ignore some of the secondary 
effects, such as the contribution of finite signal slopes to delay. We aIso use a very simple 
model for MOS capacitances. Later chapters present a more accurate and comprehensive 
formulation. Considering that in practice the input signal to a gate is the output of the 
preceding gate and, in tuni, the output of a gate is the input to the succeeding gate, the 
formulation and the parameters are derived for a set of three inverting bders:  a driver, 
a ce& and a load. In a separate section, we discuss the applications of the model to the 
classical problem of driving large loads, and generation of complementary signals. 
3.1 Single Stage CMOS Inverter 
The parasitic capacitances of a circuit characterise its delay and energy consumption. In an 
MOS transistor, there are two main capacitances: the gate capacitance, and the source and 
drain diffusion capacitances. Figure 3.1 shows the layout of an MOS transistor. 
Figure 3.1: Layont of a single MOS transistor. 
The gate capacitance c m  be expressed as 
where A, is the gate area, C, is the gate oxide capacitance per unit area, L is the effective 
gate length, and W is the effective gate width of the transistor. The diffasion capautance 
at the source or drain can be expressed as 
where Z is the dinnsion length, Ci. is the average diffnsion junction capaQtance pet unit ares, 
and Ch is the average difhsion periphery capacitance per unit length. For convenience, we 
CHAPTER 3. QUICK EVAL UATION AND OPTIMIZATION OF CMOS CIRCUITS 41 
assume that the area and periphery j unction capacitances for NMOS and P M 0  S transistors 
have the same corresponding values. The above expression shows that a diffusion capacitance 
has two components: one which linearly grows with W, and another one which is constant 
independent of W. We have ignored the gate to drain overlap capacitances and the miller 
effect . If included, however, overlap capacit ances grow linearly wit h W. 
Figure 3.2 illustrates the schematic of a CMOS inverter driving a total capacitive load of 
C which includes ail the gate and diffusion capacitances at the output node. The delay of 
discharging C from Vm to 9 through the NMOS transistor of width Wn is given by 
Figure 3.2: Schematic of a CMOS inverter driving a capacitance C. 
where In is the average value of the current during the discharging period. Assuming a linear 
relation with Wn, we can write 
where the constant Kh is a fnnction of VDD, the threshold voltage VTN, and technology 
parametas. The capacitance at the output could be split into two components. One corn- 
ponent Cg indudes all the gate capacitances at the output node, while the other component 
Cd includes all the diffasion capacitances at the output node. Thas, 
CHAPTER 3. QUICK EVALUATION AND OPTlMIZATION OF CMOS CIRCUITS 42 
where W, is the total width of the gates at the output node, Wd is the total width of the 
diffusions at the output node including W, and W,, and N is the number of diffnsions at 
the output node. The above equation can be simplified as the following by introducing 
parameters KD, Kb, and Kg. 
A similar equation can be written for the charging delay tkough the PMOS transistor of 
width W,. If p represents the ratio of the saturated currents in an NMOS transistor to that 
of a PMOS transistor of the same width, then 
Analytically, p is given by 
where p,, and p, are the effective electron and hole mobilities, and cr equals 2 for long 
channe1 devices and around 1.25 for short channel devices (761. Alternatively, p is defined as 
the width ratio of the PMOS transistor to NMOS transistor at which the rising and falling 
delays are equal. Using this definition, p can be found by simulating a chah of inverters. 
Figure 3.3 shows that for our technology at VDD = 3 V, p = 2.2. Note that the rise tirne and 
fa11 t h e  (i.e., 10% to 90% of output signal'value) are not equal at this value, as illustrated 
in the figure. 
The dynamic energy requKed to charge and discharge the capacitance C is approximately 
@en by 
Which can be simplified as 
Here, we have ignored the &ect of short-circuit murent. 
CHAPTER 3. Q U E K  EVAL UATION AND OPTIMIZATION OF CMOS CIRCUITS 43 
t Fail Time 
Width Ratio of PMOS to NMOS 
Figure 3.3: Simulation resdts for a chah of five inverters to extract the value of p. 
3.2 Cascaded CMOS Gates 
The delay of a CMOS gate is characterized by its driving gate and the gates it drives. To 
simplify design and analysis, a CMOS gate can usually be represented by an equivdent 
CMOS inverter [66]. In Figure 3.4, a CMOS gate, its driving gate, and its load have d been 
modeled as inverters. The ratio of PMOS to NMOS transistors in the inverters is represented 
by r. We are assaming a anûorm r for all three gates. This restriction is removed in lata 
chapters. 
3.2.1 Total Delay 
The total delay, D, defined here as the s u m  of a rising transition delay and a faIlhg transition 
delay between the input of the drive and the output of the c d  is given by 
CHAPTER 3. Q UICK EVAL UATION AND OPTIMIZATION OF CMOS CIRCUITS 44 
I I 
Drive ' Cell a Load 
Figure 3.4: A CMOS driver, gate (cd), and load represented by CMOS inverters 
Where the delay due to the gate capacitances is denoted by 
and the delay due to the diffusion capacitances is denoted by 
Parameters 6,6', and 6 are constants defined by the above eqaations. For fiutber simplifica- 
tion, the second tenn in Dd can be neglected, since it is relatively small for typical values of 
W and WD, and it also decreases as W and WD inmese. Therefore, as a firat approximation 
So, the delay consists of h o  termp: a gate-delay term, which depends on the size of the 
transistors, and a diffirsion-delay term, which depends solely on the number of the diffusions. 
CHAPTER 3. QUICK EVAL UATION AND OPTIMIZATION OF CMOS CIRCUITS 45 
The larger the driver and the smaller the load, the smder is the delay. If WD » W and 
WL < W, then D is reduced to its minimal value and approaches Dd = 28'. For fixed WD 
8D and WL, the optimum value of W ob tained by solving = O 
W =  4%
This is an important result for optimization of CMOS digital 
is expressed by 
(3.10) 
circuits. It is interesting to 
note that the above expression remains valid even if the driver is driving some other gates 
in addition to the ceil. Using the above formula it can be easily shown that the maximum 
frequency in a chah  of inverters is achieved if all inverters are of the same size. Figure 3.5 
shows the variations of as a fnnction of W obtained by HSPICE simulations of the circuit 
in Figure 3.4 with r = 2. The driver size WD is fixed at 5pm and WL is changed fiom 5pm 
to 100pm. The optimum values of W as obtained from Equation 3.10 are indicated on each 









WD= 5 micron j 
0.5 1 t I I I I I 1 I I I I 
2 4 6 8 10 12 14 16 18 20 22 24 
Width of NMOS Device: W (micron) 
Figure 3.5: Variations of delay as a function of the size of the c d  for a fixed driver size and 
various load skes. Minimums obtained by the mode1 k g  equation 3.10 are indicated on 
each ctrrve. 
CHAPTER 3. QUICK EVALUATION AND OPTIMIZATION OF CMOS CIRCUITS 46 
The optimum ratio of PMOS to NMOS transistors for minimizing the delay can be found 
from = O, and is governed by 
independent of W ,  WD, and WL. In [47] the above has been derived under the assumption 
that the total width of the PMOS and NMOS transistors W, + W, is kept fixed. We 
emphasize that the above, as derived, is vaiid in general. 
Using these values for W and r in (3.9), the minimum achievable delay for fixed WD and 
WL is 
3.2.2 Rising and Falling Delays 
In a similar fashion to the total delay, the rising and falüng delays can be expressed as 
and 
respectively. The optimum W for minimizing the rising delay is, then, given by 
and that for the f f i g  delay by 
In the same way, the optimum r for the rising delay is obtained by 
+ = d p  KD(WL/W) + K;, 
KD(W/WD) + KI, 
CHAPTER 3. QUICK EVALUATION AND OPTIMlZATION OF CMOS CIRCUITS 47 
and that of the falhg delay is obtained by 
Sometimes a designer may be interested in transistor sizing for equal rising and falling 
delays. On the one hsnd, by definition D = D at  r = p for any W. On the other hand, if 
we solve fi = D for W, the result is 
independent of r. This is the same expression as Equation 3.10, which minimizes the average 
delay. Hence, if the number of cascaded gates in a path is n, then by setting 
Wi = ,/wi-dK+i, O < i < n  
T = d F  
the designer miaimizes the total delay while matching the rising and f;rlling delays of the 
signal dong the path. 
3.2.3 Energy and Area 
The total energy dissipation of the drive, c d ,  and the load in Figure 3.4 can be approximated 
by 
where 7 = KB(+ + 1), qr = Kk(r + l), and r)" = 2Kg. The last tenn is relativdy srnaIl and 
can be ignored. 
Assnming that the total gate and h i o n  area of a CMOS ce11 is a fair indication of its 
total Iayout area, the area of a single NMOS transistor of width W can be expressed by 
CHAPTER 3. QUICK EVALUATION AND OPTlMIZATIûN OF CMOS CIRCUITS 48 
where KA is a cons tant. Hence, the area of the cell in Figure 3.4 (assuming an actual inverter) 
is 
where 8 is a constant. For our technology Z = 2.5 pm and KA = 5.8 Pm. 
3.2.4 Energy-Delay Product 
A parameter which is sometimes used as the optimization criterion in VLSI design is the 
energy-delay product denoted here by 
with q5 = 6(r, + 9') and q5' = 26(qt+  7'). 
Figure 3.6: Variations of optimal r with p for delay and energy-delay product . 
The optimum r for minimising F fomd from solving BF/& = O is given by 
CHAPTER 3. QUICK EVALUATION AND OPTlMIZATION OF CMOS CIRCUITS 49 
which ranges between 0.6 to 0.7 for the typical d u e s  of p between 2 to 3. The optimum 
value of W for minimizing F is the root of the following polynomial 
for which the closed form solution is rather lengthy. The variations of optimal T with p for 
minimiring the delay and energy-delay product are illustrated in Figure 3.6. 
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 
PMOS to NMOS Width Ratio 
Figure 3.7: Variations of the delay b, energy E, and energy-dday product F for a driver, 
c d ,  and a bad based on the formulation. 
Figure 3.7 shows the variations of the delay, energy, and delay-energy product as a 
fiinction of the ratio of the PMOS to NMOS transistor width. The curves are obtained kom 
equations 3.9, 3.20, and 3.23 for WD = WL = W = 1; each m e  is then normalîzed to its 
value at r = 1. One realizes that while D has a broad minimum from r = 1 to t = 2.5, F 
has a relatively narrow minimum fiom T = 0.5 to r = 0.8. Figure 3.8 show the variations 
of F as a hct ion  of W ushg sim&tions of the circuit in Figure 3.4 with r = 2. WD is 
fixed at 5pm and WL is changed fkom 5pm to 100pm. The minimums as obtained from the 
equations 3.25 are indicated on each graph; good agreement with simulation is obsened. 
CHAPTER 3. Q UICK EVAL UATION AND OPTIMIZATION OF CMOS CIRCUITS 50 
Width of NMOS Device: W (micron) 
Figure 3.8: Variations of energy-delay product as a function of the size of the c d  for a fixed 
driver size and Mnous load sizes. Minimums obtained by the model using equation 3.25 are 
indicated on each curve. 
3.3 Some Applications and Observations 
3.3.1 Inverter Chain 
The period of osdlation of a chaixi of M inverters, where M is an odd integer, can be 
expressed by 
independent of the size of the inverters. Figure 3.9 shows the variations of P, E, and F as a 
fanction of T for a chah of five inverters. The results are obtained both through the model 
and through HSPICE simulations; good agreement between the two is observed. The period 
curve is almost flat at the minimal value of about 2.2 ns for a wide range of r.  While the 
maximum variation of the delay in the flat region is les8 thao 1.4%, the energy variation 
is 45% in the same region. Hence, by choosing a value of T dose to the Iowa end of the 
%at region, the energy saving is hnge while the delay cost is negligible. Of course, this uui 

CHAPTER 3. QUICK EVALUATION AND OPTIMIZATION OF CMOS CIRCUITS 52 
3.3.2 Tapered BufTer Design 
Figure 3.10: Tapered buffer circuit for driving large load. 
Figure 3.10 illustrates the classical problem of driving a large load by a group of cascaded 
buffers. The size of the first stage b d e r  is known. The load is also known and represented 
by a bder .  We are interes ted in finding the optimal number of stages n and the size of each 
stage to minimize the total delay. Assume that all of the buffers have the same r. Usually 
it is also assumed that the buffers decrease in size fkom W, to Wo by a constant tapering 
factor p and, hence, the name tapered buRa chah [53]. Here, however, we don't make such 
an assumption and start by applying Equation 3.10. To minimize the delay, each stage i ,  
where O < i < n, rnust satisfy Wi = Jwi-1Wi+i, which can be rewritten as 
Thus, the constant tapering factor P appears naturdly as a necessary condition rather than 
an assumption. Xn [89], this condition is proven using the fact that the arithmetic average 
of N positive integers is always larger than th& geometric average. The total delay can be 
q r e s s e d  as 
Solving = O for optimum n yields 
where O(%) is the Omega or Lambert W function [24] satisfying 
CHAPTER 3. QUICK EVALUATION AND OPTIMlZATION OF CMOS CIRCUITS 53 
The Omega hinetion is included in some computational tools such as Maple. The optimum 
,û as obtained by substituting 3.29 into 3.27 is given by 
This result is. in fact, equivalent to those obtained by authors who induded the eff'ect of 
the dinusion capacitances of the buffers, e.g. [53], although none of them has mentioned the 
relation to the Ornega function. Figure S.ll(bot tom) shows a plot of P as a function of 6'/6. 
I I I 1 I I 1 1 I I 
O 0 2  0.4 0.6 0.8 1 1 2  1.4 1.6 1.8 2 
Ratio of buffer's Output to Input Capacitance 
Figure 3.11: The Omega function n(z) (top) and the optimum tapering factor /3 as a function 
of the buffer's intrinsic to output capautance 6'/6 (bottom). 
CHAPTER 3. QUKK EVALUATION AND OPTIMIZATION OF CMOS CIRCUITS 54 
3.3.3 Generat ion of Complementary Signals 
Sometimes a designer may want to generate a pair of complementary control signals, for 
instance, two phases of a clock signal. In such cases it is important that both signals arrive at 
the load almost simultaneously. With reference to Figure 3.12, assume that using the sipal 
at the input of the driving buffer of size Wo, w e  want to produce a pair of corresponding 
complementary signals at a pair of equal loads represented by two buffers of sise Wt each. 
The values of WD and WL are 
*D 
known. 
Figure 3.12: Circuit for produchg complement ary signals. 
Let us denote the delay through the upper path Du and the delay throngh the Iowa path 
4, both of which are obtained using Equation 3.9. 
From the previous arguments we know that Dl is minimized if 
From which we obtain Wl and WI. Next, we solve Di - D, = O for W which resdts in 
For example, if WL = 20 pm and WD = 5 Pm, the above formalas give W = 4.51 p, 
Wl = 7.93 pm, and W2 = 12.60 Pm. HSPICE Simulation shows that with these dues, D, 
CHAPTER 3. QUICK EVALUATION AND OPTIMIZATION OF CMOS CIRCUITS 55 
and Di are apart by less that 10%. Note that this is only a first-order mode1 and the fine 
tuning is still left to the schematic and layout simulations. 
3.4 Extract ing Delay and Energy Parameters 
A circuit simulator such as HSPICE can be used to extract the constants in the delay and 
energy expressions. Alternatively, t hey can be calculated using the standard equations for 
the capacitances and saturation currents. We have chosen the former method using a ch& 
of five inverters. Although using a chain of three inverters seems equdy  appropriate and 
simpler, the parameters obtained in this way may be less accurate due to the possibility 
of the partial-swing behaviour of the signals. For VoD = 3 V with the 0.8 pm BiCMOS 
technology we have used, the values of the constants are listed in Table 3.1. The values of 
6, 8, and ô" can be calculated using this table and Equations (3.7) and (3.8). Similady, the 
values of v ,  r)', and 7' can be caiculated using the table and Equation (3.20). 
Table 3.1: Delay And Energy Parameters Extracted for a 0.8 pm Bicmos Technology at 
Vdd = 3 V and T = 2 
The parameters can be obtained through the following threwtep simulations: 
7 
a Simulate the inverter chsin using large transistors, e.g. W = 100 p m, so that the 
dec t s  of 6" and 7'' can be neglected. Measure the period of oscihtion Dl = 56 + 56' 
and the energy El = 5Wq + 5Wr)'. 
Insert an inverter load of size W/5 at each node of the Chain, ss shown in Figure 3.13 
with the dashed lines, and repeat the simulation. Measure the period of oscillation 
0 2  = 66 + 56' and the energy E2 = 6Wq + 5Wq'. Calculate the values of 6, q, 6', and 
$'O 
Delay 
KD = 0.0256 ns 
Energy 
Kg = 0.0176 pJ/pm 
CHAPTER 3. Q U X K  EVALUATIONAND OPTIMIZATION OF CMOS CIRCUITS 56 
a Remove the loads and simulate the diain with s m d  transistors, e.g. W = 2 p n ,  to 
capture the effects of 6' and 11". Measure the period of oscillation 0 3  = 56 + 56' + 
(101 W ) P  and the energy E3 = 5 Wq + 5 Wq' + 51". Calculate the d u e s  of and r)". 
Figure 3.13: The circuit used to extract the delay and energy parameters. through simulation 
3.5 Concluding Remarks 
In this chapter, we have introduced a formulation for the delay of a CMOS gate based on the 
widths of the transistors of the gate itself, the preceding gate (driver), and the sncceeding 
gate (load) in a path. We argued that in most cases the widths of the transistors are the 
only parameters a designer can adjast and, hence, it is important to have an adequate 
model explicitly expressing the change in pdomance as a fnnction of the widths of the 
transistors. This model is intended for fast evahation of &cuit performance, providing 
appropriate initial guess for starting running a circuit sMulator or optimhation tool, giving 
insight into simulation resalt s. The model is kep t simpler by ignoring some second- &ects. 
These dects  were added later to improve the accuracy and saitability for CAD tools. Using 
this modd, we have derived expressions for optimal transistor sizing for miniminng the total 
(average) delay, nsing delay, fding delay, and energy-delay prodact. To demonstrate the 
CHAPTER 3. QUICK EVALUATION AND OPTIMlZATION OF CMOS CIRCUITS 57 
convenience of using the model, we applied it to the dassicd problem of driving large loads 
and designing a circuit for genesat ion of complemen t ary signals. 
Chapter 4 
Modeling and Comparing 
Single-Rail CMOS Implementat ions 
of the C-Element 
Various applications have demonstrated that asynchronous circuits have grest potential for 
low-power and high-performance design. One of the most primitives frequently used in 
asynchronous control circuits is the C-element (921. For instance, the C-dement plays an 
important role in the asynchronous, micropipeline-based version of the ARM microproces- 
sor, AMULET [37]. In this chapter1 we compare and give optimizations of the most popdar 
CMOS implementations of the Celement using the formulation of Chapter 3. This chapter 
deals only with non-Merential impiementations of the C-element, which ail happened to 
be of the singlerail type. Hence, we also r e k  to them as the singlerail implementations 
of the Celement. A study on differential (double-rail) implementations of the Celement 
appears in the next chapter. Sometimes the single-rail implementations seem t o  be favoared 
over the differentid implement ations, because t hey reduce the wiring requirements and elim- 
inate the potential problems associated with the difference in the propagation delays of the 
compiementary signals. 
The CMOS implementations stndied here differ in topology for maintaining the state 
'The content of tbis chapter has dso appaed in [al] and [83]. 
58 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-GLEMENT 59 
of the output when the gate is idle. One single-rail, conventional implernentation of the 
C-element has been introduced by Sutherland [92] and is used of'ten in high-performance 
micropipelines; a second implementation has been introduced by Martin and has been used 
in the Caltech asynchronous microprocessor [57] and an asynchronous, low-power version 
of the ARM developed at Manchester University [37]; a third implementation has been 
introduced by Van Berkel and is being used in the TANGRAM silicon compiler for low- 
power design at Philips Research Laboratories [5,6]. The implement ations are eïaluated 
with respect to delay, energy consumption, and silicon area using analytical approximations 
and HSPICE simulations. Simulations are performed for two typical test environments: one 
to mesure the individual performance of the particdar implementation and to evaluate the 
effect of its input capautance and driving ability, and another one to mesure the group 
performance of a series of cross-coupled C-elements, where the performance of individual 
elements are mutually dependent. 
First, we introduce the C-element and its various single-rail implernentations. Then, The 
first-order mode1 of Chapter 3 is applied to each implementation followed by a comparative 
analysis. Next, the two test environments and the significance of each are deseribed. For each 
environment, a comparative study is performed using HSPICE simulations. The simulations 
support the results deduced fiom the fkst-order analysis. Finally, we conclude by indicating 
the mos t suit able single-rail CM0 S implement ation for an enagy-efficient , high-speed design 
and an intuitive justification of the results. 
4.1 The C-element 
The C-element has been introduced by D. E. Muller [65] and is therdore also called the 
"Mder Celement ." A Celement has two inputs c and b and one output c. 'Ikaditionally, 
its logical behaviour has been desdbed as follows. If both inputs are O (1) then the output 
becomes O (1); otherwise the output remains the same. For the proper operation of the 
Celement, it is also assumed that once both inputs become O (l), they will not change until 
the output changes. A state diagram is given in Figure 4.1 along with a cornmonly used 
schematic. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 60 
Figure 4.1: state diagram and schematic of the C-element. 
The behavior of the output, c, of a C-element can be expressed in terms of the inputs o 
and b and the previous state of the output, 2, by the following Boolean function 
The C-dement is closely related to the JOIN 1341, or RENDEZVOUS [XI]. The JOIN has a 
slightly more restrictive environment behaviour in the sense that an input is not allowed to 
change twice in succession. So in the state graph, the bidirectional arcs should be replaced 
by unidirectional arcs. One codd also Say that the JOiN perforrns the AND operation of 
two events, where an event could either be a nsing or a falling voltage transition. Thus, 
a JOîN produces an output transition only after both inputs have received a transition. In 
asynchronous circuits, a Celement is most often operated as if it was a JOiN. In fact? any 
C-dement implementation cari be used to redze a JOIN. h this chapter, we assume that 
environment behaviour satisfies the restrictions for the JOIN environment. 
4.2 Dynamic C-element 
The singlerail dynamic Celement of Figure 4.2 constitates the basic functiond part of a 
static singlerail implementation of the Celement. The static versions, thus, Mer only in 
mechanisms for presenring the state of the output. Since these mechariisms insert additional 
parasitics, we expect the dynamic Celement to be faster and less energy consuming than 
the static versions. The design parameters in this circuit are the main body size W, the 
output inverter size U, and r. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMBNTATIONS OF THE C-ELEMENT 61 
Figure 4.2: D ynarnic implementation of the C-element . 
In order to derive the delay of the dynamic C-element, assume the general environment 
of Figure 3.4, such that the C-element is driving an inverter load of WL (i.e. NMOS width 
is WL and PMOS width is rWL) at the output node c and Wt at node d. The delay, hence, 
is governed by 
which has been intentiondy written out in some detail so that the reader can verûy. The 
energy and area are sirnply as follows 
We compare the delay, energy, and area of the dynamic Celement to the values for the static 
implementations to measure the deviations fiom an ideal case. Also, we use the dynamic 
C-element to find the optimal design parameters in a given environment and use the same 
values for the static Celement implementations. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 62 
4.3 Standard Implementation of the C-element 
The standard pull-up pull-down static realization of the C-element is shown in Figure 4.3. 
This circuit has been presented in [92] by Sutherland. This implementation is ratioless, 
i.e. it does not impose any restrictions on the sizes of the transistors. From the operation 
of the circuit, we conclude that NI, N2, and N6 are the main pull-down transistors which 
contribute to output switching; they are of size W, W, and II, respectively. Whereas N3, 
N4, and N5 only provide the necessary feed-back to hold the state of the output when values 
of the inputs do not match; hence, they are made as small as possible to reduce their loading 
effect. Similady, the feed-back transistors P3, P4, and P5 have minimum width w ,  while P l  
and P2, and P6, the normal pull-up transistors, have widths TW, TW, and ru, respectively. 
t 
Figure 4.3: Standard implementation of the Celement (921. 
The delay of the standard implementation can be expressed by 
2 w  2 w p  2 w  2 -+- +-+- 
+KD(WD W D r  [I VI') 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 63 
which has two extra terms compared to that of the dynamic implementation. The two terms 
represent the cost of the output-state-holding mechanisrn and vanish as WD, W, and U are 
increased such that w becomes negligible. The energy and area, as given b e b ,  ais0 become 
approximately equivdent to those of the dynamic implementation for large W and CI. 
4.4 Implementat ion of the C-element 
with Weak Feedback 
The C-element implernent ation illustrated in Figure 4.4 has been presented by Martin [55]. 
This circuit utilizes an inverter latch to maintain the state of the output when the inputs do 
not have the same logic level. Hence, in this chapter it is refmed to as the Weuk Feedback 
hplementation. The circuit sders  fkom a race problem at node d. 
There is an inherent resistance to switching the state of the latch that can be reduced, 
but cannot be eliminated. For a proper opaation of the circuit, certain size ratios must be 
imposed on the transistors. The feed-back inverter shotdd be a weak one to allow changes in 
the state of the latch. The race problem reduces as the resistance of the feed-back inverter 
transistors increases. This can be done by inaeasing the length of these transistors with 
respect to th& width. This solution, however, incresoes the load at the output and makes 
node c' more susceptible to noise. Therefore, minimum size transistors are chosen for this 
inverter. 
The delay of this cirenit has extra components due to the race problem at node d; it can 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 64 
Figure 4.4: Implementation of the C-element with weak feedback inverter [55]. 
be expressed as 
The first two terms represent the gate and dfision ddays for the input node a or b and the 
output, except the output gate delay due to the feed-back inverter. The third term represents 
the gate delay at node d and the gate delay at the output due to the feed-back inverta. The 
fourth term is the diffasion delay at node d. When node c' is to switch fiom high to low, a 
current proportional to 0.5 W flows to ground throagh NI and N2. This m e n t ,  however, 
is opposed by a m e n t  proportional to wl(2p) which flows fiom P3 into d. Note that if c 
was a stable low, the current throagh P3 would have been w/p, and when c goes to a stable 
high, the m e n t  throngh P3 arill become zero. Hence, here we are taking an average valne 
of the carrent to capture the efKect of c being driven high at the same time as d is f&g. 
The net m e n t  discharging Ç, thas, is 0.5W - wl(2p)  which appears in the denominators 
of the appropriate terms in the delôy expression. Similarly, the net ament charging d is 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 65 
O .5 WT l p  - ( w / 2 )  as it appears in the delay expression. Hence, the delay expression demands 
that W > w / p  for Nl and N2, and that W > pwlr for Pl and P2 for the output low-tehigh 
and high- t d o w  swi tchings t O take place, respectively. Comparing the delay expression a i t  h 
that of the dynamic implementation, we realize that the third term has replaced 26UJW and 
the fourth term has replaced 26'. The third and fourth tems are reduced to the rnentioned 
terms, respectively, if w = 0. 
The energy expression for the weak feedback implementation, stated below, consists of 
two parts: the useful energy, EU, spent on charging and discharging the nodes, and the 
energy wasted in the race at d ,  EW. 
The useful energy is the energy of the dynamic implementation plus the energy spent on the 
feed-back inverter. In  general, the wasted energy at a node with a total load capacitance C 
cm be approximated by 
the wasted current, Dw is the dtuation (delay) of wasting energy, I ia the 
average net m e n t  flow during Dw, and W, and Wa are the gate and diffasion components 
of C. Applying this to node d ,  expression (4.10) is obtained. According to this expression, 
the wasted energy decreases as W increases, because the daration of wasting energy gets 
smaller . 
The weak feedback implementation has a device connt of 8 compared to the 12 of the 
standard implementation. It also has a srnaller area by 4wKn. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-EXEMENT 66 
A modified version of the weak feedback implementation is shown in Figure 4.5. In this 
circuit, the feed-back inverter has two additional transistors, N5 and P5. These transistors 
are weak and skinny acting as resistances to limit the current flow and rednce the race 
problern during switching. We refer to this circuit as the Resistive Implementation of the 
C-element. We have irnplemented N5 and P5 with a channel length of L' = 2p (instead of 
the minimal 0.8~) and the minimum channel width w = 1.4~.  A disadvantage of using these 
skinny transistors is that node d becomes more sensitive toward noise. 
Figure 4.5: Implementation of the C-element with resistive inverter at the feedback. 
By inserting N5 and P5, the pull-up and pull-down ctments through the feed-back are 
scaled by a factor of h given by 
T h d o r e ,  the delay and energy expressions for the resistive implementation can be easily 
obtained by modi&ng the corresponding expressions for the weak feedback implementation, 
i.e., by replacing all the occurrences of w/2  by wh/2 and those of wl (2p)  by whl(2p). Since 
h is less than onity, one can verify that DR is less than RF and that ER is less than EF. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 67 
The area of the resistive implement ation, however, increases compared to the weak feed- 
back implementation by 4w(Z + L') as a penalty for the modification made. 
4.5 Symmetric Implementation of the C-element 
The C-element implementation by Van Berkel (21 is shown in Figure 4.6. In this circuit, the 
output state is maintained through a feed-back conducting path of three transistors in the 
puil-up tree or the pull-down tree. If the input d u e s  are not equal and the output is low, 
the conducting path consists of either Pl, P3, and P5, or P4, P3, and P2. By symmetry, 
when the output is high, a path of N1, N3, and N5, or N4, N3, and N2 is responsible 
for latching the output. Two parallel paths of P-devices or N-devices contribute to output 
switching. During output rising, Pl and P2 in pardel to P4 and P5 are conducting. During 
output falling, N1 and N2 in parallel to N4 and N5 are conducting. Similar to the standard 
implementation, this circuit is also ratioless. An advantage of this implementation is that it 
is symmetric with respect to the inputs and, thus, we c d  it the Symrnetn'c Mplementation 
of the C-element. For the circuit to have the same pull-up and pull-down resistances (when 
switching) as the previous Mplementations, the normal N-tree and P-tree transistors, except 
those of the output inverter, must be made half the size (i.e. W/2).  The feed-back transistors 
N3 and P3 are, as usual, of minimum size, and N6 and P6 have a normal size to achieve the 
load driving capability of the previous circuits. 
The delay, energy, and aiea of the symmetric implementation are govemed by the fol- 
lowing expressions. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 68 
Figure 4.6: Symmetric implement ation of the C-element [2]. 
The delay expression has one additional term compared to the dynamic implementation. 
hterestingiy, this extra term also appears in the delay expressions of the standard, weak 
feedback, and resistive implementations. Hence, for the same value of W, the symmetric 
implementation is expected to have the lowest delay compared to the otha static imple- 
mentations. The very same argument appiies to the energy when compared to the energy 
expressions of the other static implementations. The area of the symmetric implementation 
equals that of the weak feedback implementation and is less than those of the resistive and 
standard implementations. The symmetnc implementation has a device count of 12, e q d  
to that of the standard implementation and higher than the weak feedback and resistive 
implement ations by 4 and 2, respectively. 
Our analysis shows that the symmetric imp1ementation is the best candidate for energy- 
efficient, hi&-speed designs, as it gives the le& delay per unit energy among the static 
implementations. Our mode1 &O indicates that as W and U are increased, eventady, the 
delay and energy of all of the static implementations coincide with the asymptotic delay and 
energy of the dynamic implementation. 
CHAPTER 4.  SINGLKRAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 69 
4.6 Effect of Arriving-Time Order of Inputs 
As mentioned before, the dynamic C-element structure is common to all static C-element 
implementations considered here; hence, the foilowing arguments are made in reference to 
Figure 4.2. Let's name the internal node between the two NMOS transistors "N", and the 
internal node between the two PMOS transistors "P". To produce an output, there are 
two possible orders of input events: a arrives before b and b arrives More a. Assume, 
initially, the inputs are low, thus d and c are high and low, respectively. If o nses first, 
node N is discharged; and when b nses, c' goes low with minimal body effect. If b rises first, 
however. charge distribution takes place between nodes N and d. In the dynamic C-element, 
the charge lost in c' cannot be replaced, and the consequence is that when a rises, c' is 
discharged relatively slower due to the higher body efFect. The scenario is worse in the static 
C-elements in the case of b arriving after a, because the charge lost in d is gradually replaced 
through the keepers and N cm charge up to VDD - VTN. Hence, the total effective chaxge in 
swit ching could increase by CN (VDD - VTN), where CN is the total capacitance at N. This of 
course, increases the delay and energy consumption. A similar argument applies to node P 
in the process of discharging the output, if b f& before a does. Again, the effective charge 
in switching could increase by CP(VDD - VTP), where Cp is the total capacitance at P. 
The extra delay per cycle, ignoring the body effect, can be approximated by 
where v = (VDD - K)/VJD. The extra energy per cycle, can be expressed by 
The effect of the order of input a r r i d  on the keepers is usually negligible due to the srnall 
size of t heir transistors. 
4.7 First Test Environment 
Figure 4.7 illustrates the first measarement setup. The Celement is driven by tao inverters 
which are fed by the ideal sources V. and x. At the output, the Celement is driving a 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 70 
Figure 4.7: First measurement setup: testing for optimal sizing for known fansut. 
number of inverters similar to the input inverters in size. The NMOS and PMOS transistors 
of the inverters are 4 and 10 microns wide, respectively. 
The input inverters and the C-element gate are supplied with the voltage source VDD, 
whereas the fan-out invaters have a separate power supply, Vm. This is to make sure that, 
when measuring the energy consumption, the effect of the input capautance of the C-dement 
and its own intenial parasitics are accounted for. The typical testing input waveforms and 
the expected output wavefonn are &O depicted in Figure 4.7. The delay tkough the G 
element depends on the order and arriving times of the input signals. Exhaustive s ida t ions  
are needed to h d  the worst-case delay by altering the arriving interval between the input 
signais. Sidations have shown that in the case of the Celement, the worst-case dday is 
obtained when one input arrives shortly (las than 1 ns) d e r  the other. A slightly leas 
accurate, but faster technique is to be only concerned with the arriving order of the input 
signals. As shown in the figure, using this method, six cases are considered and the respective 
delays prodnced are m e m e d :  V. rises before 6 gives Tf1, Co f a  before & gives T,i, 6 
rises before V, gives Tf2, & faUS then V. gïves TV?, l& and & simaltaneously rising gives T', 
and simultaneoasly faIling gives TF3. Accordingly, the worst-case fd time Df , the worst-case 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 71 
rise time D,, and the overall worst-case delay D, are obtained by the following. 
In cases where one input changes before the other, the time difference between them is 5 ns. 
The energy consumption of the C-element per output transition also depends on the 
arriving times and order of the inputs. We measure the average energy consumption per 
output cycle E, corresponding t o the illustrated wavefoms, from the foilowing. 
where P(VDD) is the average power extracted from VDD, and T is the total period of the 
output waveform. The SPICE simulations for this setup wae performed at a fiequency of 
40 MHz. 
Assume that we would like to design the C-dements for a standard c d  library and test 
t hem in the environment of Figure 4.7. We chose r = 2.5 very close to the value of p (Le. 
3.08 in our technology) to obtain almost equal rise and fd times. W e  need to decide npon 
the optimal d u e  of CI. For this purpose, we use the dynamic implementation as ref'erence 
point and take its delay equation as in (4.1). 
This yields the following 
We further assume that 
value for optimum U 
our optimal design targets a fan-out of two Celements. Bence, 
WL = 2W and UOpt = W. The only remaining design parameter is W, which now can be 
obtained by assaming some value for WD and a target deiay. 
The simdations for this environment are performed for a range of W = 2 ... 30, U = 
W, and fan-out of three inverters. Each inverter has an NMOS transistor width of 4 pm 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 72 
and a PMOS transistor width of 10 Pm. Figure 4.8 shows the energy dissipation versus 
the propagation delay for the C-element implementations. The size of the C-element gate 
increases for each curve from the right hand side of the graph toward the top of the p p h .  
Thus, one might get two different energy readings for the same delay and implementation. 
but they correspond to two dXerent sizes. 
,*" HSPICE Simulation - 
Fan-out of 3 inverters - 
- 
- 
1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 
Worst-Case Propagation Delay (ns) 
Figure 4.8: SPICE simulation resdts, energy versas delay, for the Celement gates in the 
first test setup. 
Point "X" in the figure shows the minimum delay achievab1e nsing the weak feedback 
implementation of the Celement. This delay is around 1.34 ns costing 6.2 pJ in energy 
and 100 pm2 in area For the same delay, if we chose the standard implementation (me 
the cross point between the dotted vertical line and the standard implementation m e ) ,  we 
wodd spend 3.6 pJ and ocmpy only 50 pm2, a saving of 42% in energy and 50% in area 
Furthemore, if we chose the symmetric implementation, it wodd consume only 2.6 pJ and 
occupy 30 pn2. A saving of 58% in energy and 70% in area over the weak feedback circuit. 
Similady, point "Y", D = 1.27 ns and E = 5.2 pJ, shows the rnininium obtainable delay using 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 73 
the standard implement ation wit h an area of 90 The symmetric circuit obt ains the same 
delay for 2.85 pJ and 35 saving 45% and 61% in energy and area, respectivdy. Notice 
also that delays between 1.15 ns to 1.27 ns can be achieved by the symmetric circuit but not 
by the ot her st atic implement ations. The performance of the resis tive implement ation (with 
L' = 2.0) is very similar to that of the standard implementation. 
Energy-delay ( E D )  graphs similar to that of Figure 4.8 are convenient to compare d- 
ternatives. In such graphs, the outermost curve closest to the origin represents the best 
choice. The gaph demonstrates that the f i s t  test is very much in favour of the symmet- 
ric implementation of the C-element. Although the minimum delays of the circuits are all 
wi t hin a 10% clifference, the symmetric implementation achieves a significant energy savings 
of about 50% for the same delay. Drawing the E D  graphs for other fan-out values, it is ob- 
served that the energy savings of the symmetric irnplementation for the same delay further 
increase (wit h respect to the ot her implement ations) as the fan-out decreases, since energy 
consumption due to the capacitive load becomes a less significant portion of the total energy , 
consumption. The same is true for energy savings using the standard implementation over 
the weak feedback implementation. 
4.8 Comparing the Mode1 with the Simulation Results 
Figure 4.9 shows the Energy-Delay graph of the different Celement implement ations based 
on the modd. The calculations include iEo,  since Eo is only relevant in two ont of six 
input transition cases. For example, for the symmetric implementation at W = 10 Pm, Eo 
is over 17% of the total energy p a  cyde. The delay D caldated by the formula for each 
implementation is actually the s u m  of the rising and falling delays. In Figure 4.9 we have 
plotted D/2,  whkh is the average of these delays; thus we have a way of comparing the 
resdts obtained from the formulas with the results obtained fiom the simulations. Notice 
that the calculated delay D does not distinguish rising and f i h g  delays, nor does it include 
Do, the extra delay caased by a difference in arrival t h e  of the inputs. The generated graph 
is in good agreement with the resalts obtained by simulations shown in Figure 4.8. Fistly, 
we observe that the general shape of the m e s  matches that of the simulations. Secondly, 
the relative order and position of the m e s  correspond to those of the simuhtions; even the 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 74 
I I 
- Calculated Using the - 
Delay and Energy Models - - 
bD =3.0 V _ - 
Fan-out of 3 inverters - - 





- Weak Feedback - 
----9-0--..........--...--.-a 
- - 
- ' - Symmetric Standard - 
1 1 1 I 1 1 I ! I 
1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 
Propagation Delay (ns) 
Figure 4.9: Energy-Delay graph of the C-element implementations in the first test environ- 
ment based on the model. 
point where the resistive implement ation curve crosses over the standard implement ation 
curve has been captured by the model. Some disagreements in the values of the energies and 
delays arise h the fact that the simulation results are plotted for the "worst-cd delays. 
More complicated phenornena aecting the delay are not inclnded in our model, such as the 
body a e c t  and the impact of the initial conditions at the interna1 nodes. 
A more detailed cornparison is foand in Table 4.1. In the first colamn, the Sasic t ra~~is to r  
width, W, is listed. The cornparisons are made for a range of relativdy (i.e., with respect to 
the load) small to large widths. Colnmns 2 to 4, list the average energy, the m h u m  delay, 
minimum delay, and average delay (of six cases) obtained through simulations. Notice that 
the range of the delays for a particular W can be quite large. For example at W = 3 pm for 
the weak feedback implementation, the minimum delay is h t  half the mzutimum delay. 
Also considerable is that the "bestn over all maximum delays (obtained by the symmehic 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 75 
Table 4.1: Comparing SPICE simulation results with those of the analytical mode1 for the 
C-elemen t imolement at ions. 
I I  D ynamic Implement ation 
W 
- ( p m ) - .  
II Standard Implement ation 
II Weak Feedbadc Implemen t ation 
SPICE SIMULATIONS 



















































CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 76 
implementation) are larger than the "worst" over all minimum delays (obtained by the 
weak feedbadc implementation). The next 3 columns list the energy, delay, and silicon 
area calculated using the model. Findy, in the last two columns, we have the percentage 
difference between the results of the simulations and the model. It is interesting to note the 
calculated delay value is always within the delay range obtained by simulations, i.e., it is 
never larger than M a x .  or smaller than MinD. 
4.9 Second Test Environment 
The second messurement setup is shown in Figure 4.10. The C-elements in the chah are 
all of the same sise. This setup cliffers form the fist  setup in that the performance of the 
C-elements are now mutually dependent, which dec t s  the overall performance of the chah 
A bubble at the input of a C-element schematic means that an inverted version of that input 
must be used. This, of course, can be implemented using an inverter. Alternatively, if the 
input signal is coming hom a gate which produces complementary ontputs, the complement 
of the output can be directly used. For this test, the latter technique is chosen. The chah 
of the C-elements shown, without the inverters at the two ends, form the control circuit 
of an n-stage micropipeline. The inverters are added to make the micropipeline self-driven 
and oscillating. The signals at the nodes indicated by r(#). where '#" represents the stage 
number, can be interpreted as "request for the next stage". Initially, all nodes are set to low, 
and the only possible event is the nsing of r(0). This logic uonen then propagates through 
all the request nodes. Meanwhile, other transitions are produced at r(0) which propagate 
toward the end, in tum. If not interfered, the oscillation of the nodes continues forever. A 
much simplified waveform is shown in the same figure. The parameters of interest in this 
test are the throughput and energy per throughput E. The fkequency of oscillation of the 
micropipehe F, which is half its t k g h p u t ,  is the inverse of T. The energy dissipation 
during this period is 
where P(VoD) is the average power extracted fkom VDD. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 77 
Figure 4.10: Second meaurenient setup: testing for optimal sizing of a chah structure in 
presence of feed-back. 
Figure 4.11 illustrates the dynamic Celement in the micropipeline test environment. In 
order to maximize the fiequency of operation of the micropipeline, the delay around the 
boldface loop, i.e. the period, must be m;nimized. The period can be expressed by 
and the optimum U obtained is 
Thus, the period is reduced to 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 78 
Figure 4.11: The dynamic C-element in the second test environment. 
which is independent of W according to our model. However, as we s h d  see, simulations 
show that the period does depend on W for small W. This can be partly explained by the 
approximation made in (3.9). SPICE simulations have confirmed that the optimal value 
of U for the micropipeline test circuit is 0.5W as shown in Figue 4.12. Figure 4.12(left) 
ihstrates the simulation resnlts for W = 25p, U = 0.4W, as r varies. The m k n m  
fiequency is achieved around r = 1.5. If we want to design a circuit with spmetrical rise 
and fall times, we would choose r around 3, but simulations show that this inaeases the 
energy consamption by 60% and reduces the frequency by 10%. Figure 4.12(rîght) illastrates 
the simdation results for W = 25p, r = 1.4, as U varies. The frequency is at its maximum 
around U = 0.5W. Hence, the static Gelements are designed with r = 1.5 and U = 0.5W. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 79 
0.33 1 t  t  1  I  I 1  I 1 I I  1 1  1  1 O 
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 
Ratio of width of P-transistors to N-transistors: r 
0.25 1 1 1 1  1 1 1 1 1  1 1  1 , )  10 
O 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 
Normalized size of the output inverter transistors: U N  
Figure 4.12: Optimization of the dynamic Celement in a micropipeline control circuit. 
The resdts of the simulations for the second test environment are depicted in the form 
of an energy-fiequency (EF) graph in Figure 4.13. Similar to the E D  graph, in an E F  
graph the outermoilt m e  (here hrthest fiom the  origin) is the best alternative. For every 
implementation, each data point is taken at a different size. As we go &om the left to the 
right dong the carves, the size of the implementations increase, yielding higher fieqtlencies at 
higher energy costs. The c w e  for the dynamic implementation is also shown in the figure to 
measare the deviation of the performance of each implementation &om this ided behaviour. 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATZONS OF THE C-ELEMENT 80 
8-Stage Micropipeline Control Circuit - 
0.8 Micron BiCMOS Technology 






- Weak Feedback - - - - - - - -  - 
I 1 I 1 1 1 
. lS  0.2 0.25 O. 3 0.35 0.4 0.45 0.5 
Frequency (GHz) 
Figure 4.13: Energy-fiequency graphs based on SPICE s ida t ions  for the single-rail C- 
element implement ations under the second test . 
The almost vertical line at 0.425 GHz, i.e., the highest fkequency obtainable with the dynamic 
implement ation is the asymptote for the performance of the single-rail implementations. 
This is 15% higher than the value predicted by onr model, 0.362 GHz. The reason is that, 
accordhg to the simulations, node c' ha9 a voltage swing of arotmd 0.8VDD rather than 
VDD . Considering the energy-frequency trade-offs, the s ymme trie implementation seems to 
be the circuit of choice among the static implementations for fiequenues below 0.4 GHz. 
For instance, a ikeqnency of 0.375 G E  can be obtained by the symmetric implementation 
at an energy of 14 pJ at W = 9 Pm. The same fiequency costs 27 pJ, W = 17 Pm, with 
the standard (or a little more with the resistive) implementation and 54 pJ, W = 35 p, 
with the weak feedback implementation; that is, extra energy by factors of 1.93, and 3.86, 
respectively. The dynamic impIementation is capable of produchg the same frequency for 
less than 5 pJ with W = 3.5 pn. We &O observe that the m e s  for the standard, weak 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 81 
feedback, and resistive implementations are heading towards the asymptote, whereas that 
of the symrnetric implementation saturates at a lower frequency. The reason is that these 
three implernentations have exactly the same pull-up pull-down topology as the dynamic 
implementation does. The symmetric implementation differs, however , in t hat it has two 
parallel pull-op pull-down structures symmetric with respect to the input signals, and thus, 
it is unable ta take advantage of the difference in arriva1 t h e  of the two input signds as the 
other implementations would do. This was not predicted by our analytical model. 
4.10 Concluding Remarks 
A comparative study of the singlerail CMOS implernentations of the C-element in terms of 
energy, delay, and area has been presented based on a fist-order analysis and SPICE simu- 
lations. Four single-rail implement ations from the Literature have been considered. Specid 
attention has been given to the energy efficiency of the implementations in order to identify 
the proper choice for an energy-efficient high-speed design environment. The techniques we 
discussed can be extended to compare different implementations of other primitives. The 
energy-delay and energy-fiequency graphs illustrated in this chap ter are a convenient way of 
presenting the trade-offs between energy and delay (or frequency). 
The simple first-order model for the delay and energy of CMOS gates presented in Chapter 
3 has been applied to the C-element implementations for optimization and cornparison. 
Good agreement is observed between the model and simulation results. The C-elements 
were tested in two different environments using simulations. In the first one, a Celement of 
variable size and fan-out is driven by fixed-size inverters. The energy dissipation and delay 
of the different implementations were meagnred through SPICE simulations. In the second 
test environment, the circuits were tested for th& performance in the presence of feed-back. 
For this purpose, a chah of eight Celements in the form of a micropipeline control circuit 
was used. Using SPICE simulations, the energy dissipation per cyde and the freqftency of 
oscillation of the control circuits were measnred and reported. The resdts obtained tkough 
the f i s t  setup are us& when we want to insert a Celement in a spot where the input 
drive and output capacitance are h o m .  The second setup, however, can give suggestions 
for the implementation of a sabsystem in which the Gelements are mntually dependent. In 
CHAPTER 4. SINGLERAIL CMOS IMPLEMENTATIONS OF THE C-ELEMENT 82 
OUT study we did not indude any layout considerations. We have also ignored wire loads. 
Considering that wire loads play a more important role in fnture technologies, the results 
obtained here should be used cautiously. 
Both first-order analysis and simulations in the two test environments identify the sym- 
metric implementation as the right choice for energy-efficient, high-speed applications. In 
both tests, the shape of the energy-delay curve for the symmetric implementation is closest 
to that of the dynamic implementation of the C-element, because the symmetric implemen- 
tation has the least overhead for maintaining the state of the output. The keepers in the 
symmetric implementation are only two minimum size transistors, which do not resist output 
switching. In the standard implementation of the C-element, the keepers are six minimum 
size transistors which do not resist output switching either. In the weak feedback implemen- 
tation of C-element, however, the keepers are in the form of an inverter, which does resist 
output switching. Hence, the curve of the weak feedback implementation is farthest hom 
that of the dynamic C-element. The resistive implementation is a modiiied version of the 
weak feedback implementation with two additional transistors to mate it less resistive to 
output switching. The performance of the resistive implementation c a a  be improved at the 
cos t of higher noise sensi tivi ty. 
This cornparison demonstrates that minimizing the number of transistors dedicated to 
latching (keepers) and avoiding topologies that resist output switching may result in sig- 
nificant energy savings. In a low-power environment, a designer can not S o r d  to spend 
energy on transistors that do not contribute to output switching and are only dedicated to 
latehing the output, even if they are made of smdest size. After all, in a low-power design, 
most transistors are made small, and the keeper transistors may have a large share in the 
total energy consnmption. This stndy also demonstrates how the various implementations 
of the keepers change the performance of the very same circuit and affects the energy-delay 
trade-off. 
Chapter 5 
Different ial CMOS Implement at ions 
of the C-Element 
Differential logic circuits, also cded  double-rail circuits, usually require both the input 
signals and their complements. In return, they produce the output and its complement 
in a s ymmetrical manner. Alt hough differential logic &cuits usually ciouble the wiring 
requirements, they may be beneficial in terms of speed, energy, and area, as reported for 
example in [105]. The most well-known differential CMOS logic f d e s  are Differential 
Cascade Voltage Switch Logic (D CVS) [42] and Complementary Pass-Tkansistor Logic (CPL) 
[IOFI]. The Celement implementations in this chapter do not quite correspond to either logic 
families, because they use an inverter latch to hold the state of the outputs. Hence, we refer 
to their class as the Differential logic with Inverter Latch (DE). 
This chap ter' first introduces the basic DIL implementation of the Celement. Then, a few 
moditied versions of the DIL Gelement are presented and their advantages and disadvantages 
are exphined. After t hat , the problem of divergence of complementary outputs in a pipeline 
is addressed. Finally, we conclude with an overd cornparison of the single-rail (iadading 
conventionai) and doubl+rail (differentid) implementations in the h o  typid environments 
discussed in the previous chapter. It is important to note that in this chapter eduations 
and optimizations of the circuits are mainly based on HSPICE simulations. The reason is 
that the formulation introduced in Chapter 3 does not conveniently cover Werential logic 
=The content of this chapter has also appeared in 1821. 
83 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF TWE C-ELEMENT 84 
gates. A unified model for delay estimation and optimization of conventional and differential 
logic gates will be introduced in the next two chapters. In fact, the exhaustive simulations 
required for optimizing the DIL implementations motivated us to develop the delay model 
of the next chapters. 
5.1 Basic DIL Implementation of the C-element 
Figure 5.1: DIL implementation of the Celement. 
The basic CMOS implementation that uses differential logic and an inverter latch (DIL) is 
shown in Figure 5.1. It consists of two pull-down trees of NMOS transistors and an inverter 
latch fomed by the devices PFL, NFL, PFR, and NFR. If the inputs are both high, then Ç 
goes low and c is pded up through the P-device of the right hand side inverter of the latch, 
PFR. Similady, if both inputs are low, then c goes low and d rises throngh the Paevice of 
the lefk inverter of the latch, PFL. If the inputs don% match, the inverter latch holds the 
previous valaes of the outputs. 
The NMOS transistors of the inverter latch are inactive during switching and, thas, 
are of minimum size w to redace their capacitive loading eEect. The PMOS transistors of 
the inverter latch, on the other hand, are active during switching and th& proper siPng 
is critical to delay optimization. For the circuit to fanetion properly, the pd-down trees 
CHAPTER 5. DIFFERENTIAL LMPLEMENTATIONS OF THE C-ELEMENT 85 
must be made l e s  resistive than the latch P-devices. Accordingly, the following must hold: 
W' < 0.5pW; where W, is the width of the P-devices in the inverter latch and W is the 
width of the N-tree transistors, as shown in the figure. Although this gives an upper-bound 
for W,, it leaves the designer with a wide range of widths to choose. However, we know that 
if W, is made too small, the circuit will s a e r  from high risetirne delay. A large W,, on the 
other hand, intensifies the race problem at the f&g output node and resdts in both large 
falling and rising delays. 
1 1 1 1 L 1 I 1 I I f 1 I 
0.1 0.2 03 0.4 05  0.6 0.7 0.8 0.9 1 1.1 1.2 1 3  1.4 
















Figure 5.2: Delay and Energy of the DIL impIementation for various sizes of the PMOS 
device in the latch. 
Rise-Time, Faniiut=3 -t - 
- \ \ Fall-Time, Fan-out=3 -+ - 
\ Rise-Time, Fan-out=7 * - - \ 
\ 
Fall-Timc, Fansuk7 -x. - 
\ - Q. - 
- **a-  ,-O 




x - . x  - - . x- , . .+ * -  - - - *....+ - . x .  - K . '  '* ' - >( . ' K' + .... + - a - -  - +... -+  .... + - . - - + - - * - + - - a - +  - 
1 I 1 1 I 1 1 I t I I I 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 
Normalized Width of the Latch PMOS Transistors 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 
Figure 5.2 shows the results of HSPICE simulations for the DIL implementation at various 
values of WJW.  The simulations are done for fansuts of 3 and 7 inverters. The minimum 
rising deiay for a fan-out of 3 occurs at around 3 = 0.6. When the load is increased to 
a fan-out of 7, i.e. more than doubled, very lit tle shift is observed for the minimum delay 
point dong the horizontal axis to a value of 0.7. In o u  designs we have ehosen 3 = 0.6 
for the DIL implementation regardless of the load. The analysis of Chapter 7 indicates that 
for loads much larger than the input capacitance, minimum delay is achieved when m S., 
which evaluates to about 0.6. The figure shows that the falling delay inaeases as W,/W 
increases. 
Moving our attention toward the energy graph of Figure 5.2, we realize that for both fan- 
outs the energy linearly increases as W,/ W increases up to a certain point, d e r  which the 
increment occurs more rapidly. It is interesting that this point of deviation fiom linearity 
coincides with that of the minimum delay. That is, after this point, not only the delay 
increases, but also it costs more in terms of energy and obviously area. The more rapid 
change in the energy can be intuitively explained as follows. Starting near the origin, as 
Wp/W inaeases the parasitic capacitances also increase and so does the energy. However, as 
we p a s  the minimum delay point by further increasing Wp / W, the duration of the fighting 
increases and we end up spending extra energy for the race on top of the usual amount spent 
on the capacit ances. The deviation 
cost for the race. 
T 
fkom linearity, shown in the graph, indicates the energy 
T T T 
Figure 5.3: Modified D E  (MDIL) implementation of t h e  Celement 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 87 
Due to the fighting at one of the outputs during switching, the DIL implementation 
has large delay and energy sensitivities with respect to load. This means that as the load 
increases, the DIL implementation will have severe disadvantages cornpared to the single- 
rail implementations in terrns of both delay and energy. One way of reducing the delay 
and energy sensitivities with respect to load is to insert a couple of inverters between the 
outputs of the DIL implementation and the loads, as illustrated in Figure 5.3. We refer to 
this implementation as the modified DIL (MDIL) implementation. 
0.1 0.2 0.3 0.4 0 5  0.60.70.8 0.9 1 1.1 1.2 1.3 1.4 15 1.6 1.7 1.8 1.9 2 2.1 




0 . 9 - 1 t " ' 1 1 1 1 1 1 1 t I I I ï 1 1 -  
0.1 0.2 030.4 05  0.60.70.8 0.9 1 1.1 1 3  13 1.4 1.5 1.6 1.7 1.8 1.9 2 2 1  
Normalized Width N-tree Transistors: u 
Figure 5.4: Delay and Energy of the MDIL implementation for d o u s  sizes of the output 
inverter (fan-out of three inverters) 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 88 
The operation of this circuit is similar to the DIL implementation except that the worst- 
case delay is now characterized by the f a h g  delay. With MDIL, the designer has to deude 
the width of the output transistors, namely Win and W.,. Let q = Wip/Win and u = W/Win. 
Figure 5.4 illustrates the results of HSPLCE simulations for the MDIL implementation with 
various values of q and u. If q = p, the output inverters have a threshold of VDD/2. This 
would be high for this circuit and results in poor fall tirne dehy, because the original DIL 
implementation ha9 a hi& rise time. As q is reduced, the threshold of the output inverters 
moves toward ground. However, if q is made too low, the circuit will s d e r  from poor 
rise time and looses sharpness in its low-tehigh transitions. Larger p also means larger 
energy dissipation, because of higher paragitic capacitance. As u increases, the falling delay 
deneases, because W, is also proportiondy increased. The nsing delay, however, first 
decreases and then increases as u increases, with the minimum being around u = 1. 
Because of the above trade-offs, the optimum choice of q and u depends on the particular 
design requirements. In the next section we see how a design environment may restrict the 
value of q while we choose u = 1 to obtain the minimum rising delay. 
5.2 Divergence of the Complementary Outputs 
In differential circuits there is a t h e  difference between the production of complementary 
outputs. In synchronous circuits this doesn't pose much of a problem, because of the presence 
of a dock signal and the absence of feed-backs. In asynchronous circuits, however, this time 
diffaence can be troublesome in two cases. One is when the faster output of a gate is fed 
badc to the same gate and arrives at the gate before the other slower output is produced. 
Another case is when a series of differential gates form a long path and the divergence of the 
complementary outputs hcreases dong the path until they become so much apart that at 
some strige one of the gates fails to respond to its inpats. 
I€ the differential gates in a circoit are active hi& Le. activated only when some of th& 
inputs become high, and theh f a h g  outputs are prodnced &st, then the gates fiinction 
properly no matter how long the production of the rising outputs take. On the contrary, if 
the gates are active high and the rising outputs are prodnced k t ,  then the divergence of the 
CHAPTER 5. DIFFEl3.E!NTIAL IMPLEMENTATIONS OF THE C-ELEMENT 89 
complementary outputs codd affect the operation of the gates. The DIL implementation, 





7 7  
0.4 ' t I 1 1 1 1 1 1 I L J 
1 2 3 4 5 6 7 8 9 1 0  
Cycle Number 
Figare 5.5: Divergence of the complementary outputs in a micropipeline control circuit: 
outputs of Merent stages at the first cycle (top), outputs of the same stage st different 
cycles (bottom). 
Figure 5.5(top) shows the time difference between the complementary outputs of the 
C-dements at varions stages of the micropipeline control Qrcuit. The Celements are im- 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 90 
plemented in DIL and MDIL with q ranging fiom 1 to 1.25. Obviously, as q increases the 
complementary outputs get further apart. Nevertheless, in al1 cases the tirne diffaence be- 
tween the outputs remains almost constant dong the pipeline except for the two ends. At 
the two ends the loads of the C-elernents are higher and thus, in the case of DIL the outputs 
become further apart while in the case of MDIL the outputs get closer. Figure 5.5(bottom) 
shows the time difference between the complementary outputs of some C-element at the  
middle of the pipeline at different cycles. For the DIL implementation we can see that after 
4 cycles the t h e  difference between the outputs converges to around 0.57 ns. For the MDIL 
implementation, q = 1.0, after some fluctuations the tirne diffaence converges after the sev- 
enth cycle. If' q is increased to 1.10, the fluctuations increase, but the time ciifference h d y  
seems to converge. Furthes increase in q to 1.20, results in a large jump in the t h e  diffaence 
between t h e  second and third cydes, and the pipeline fails to oscillate after the third cycle. 
Of course, the problem becomes worse if q is made any larger. This test illustrates that if 
the MDIL implernentation is to be used in such a micropipeline control circuit, q must be 
made less than or equal to unity. If the C-elements are driving some Iat ches (load), q can be 
made larges, because loading decreases the time difference between the outputs of the MDIL 
implementation. Successive simulations show t hat for a micropipeline control circuit without 
any load, the frequency is maicimized if u = 0.7 and q = 1.2 (we were able to set p = 1.2 by 
reducing u fkom 1 to 0.7) . These are the values used for the MDIL implementation in the 
comparative study made for the second test environment. For the f i s t  test environment, 
however, q and u are both set to unity. 
5.3 DILP and DILN Implementations 
Two possible modifications of the DIL implementation are shown in Figure 5.6. They are 
both intended to rednce the king delay which is the main draw-back of the D E  imple- 
mentation. In Figure 5.6(1eft), tao pull-up trees of PMOS transistors have been added to 
the original DL implernentation. We t e k  to this imp1ementation as the DILP Celement. 
When both inputs are low or both are high, one of the oatpats falls and the other one rises 
almost simdtmeously. Thas, the oatpnts switch independently, in contrast to the original 
D'Z implementation. In thîs case, the inverter latch is merdy used for maintainhg the states 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 91 
DILP 
Figure 5.6: Schematics of the DILP and DILN C-element implementations. 
of the outputs and, hence, have minimum size P- and N-devices. Later, we will see that the 
minimum size latch could be troublesorne in the presence of feed-back. This implementation 
has less delay and energy than the DIL implementation, and less delay but higher energy 
than the MDIL implementation. The reason for the relatively high energy consumption is 
its high input capacitance. The input capacitance can be rednced by replacing the P-tree 
with an N-tree as shown in Figure 5.6(right). We refer to this implementation as the DILN 
C-element. In the DILN implementation, during switdiing, one of the outpnts falls from VoD 
to ground, while the other output sidtaneously nses 6om ground to VDo - VTN. Then, it 
is boosted to VDD with the help of the inverta latch. Thus, the rise and f d  times are not 
symmetric, as in the case of the DILP hplementation. However, the amount of symmetry 
is greatly enhanced over the DL implementation. 
In a micropipeline, because of the presence of feed-back, the divergence of the corn- 
plementary outputs in the DILP impiementation codd be troubIesorne. Even if all of the 
complementary inputs of a DILP implementation arrive simultaneoasly, there is no guaran- 
tee that the complementary oatpnts wodd be simultaneously produced. The reasons are 
that, hstly, the P-tree and N-tree exhibit diffaent resistances depending on the 1 0 4  and 
the state of t h e  output voltages. Secondly, if the minimm size latch is nsed, then during 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 92 
- Increase In size 
Figure 5.7: HSPICE results for the DILP C-element implementation. 
5 
switching the P-tree has to fight a resistance proportional to w,  whereas the N-tree has to 
Inverter Latch W n/Wp= 1.4/ 1.4 - 4 
fight a resistance proportional to wfp. The second factor could be overcome by using a 
latch in which the P-devices are p times larger than the N-devices. The first factor could be 
O 1 1 I I 1 I 1 1 t I I 1 
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 
Frequency (GHz) 
compensated for by increasing the overd size of the latch as the loading inmeases. 
Figure 5.7 shows the results of HSPICE simuiations for the DILP implementation. The 
graph shows the fkequency of operation of the micropipeline control circuit as the size of the 
Celement is increased. With inverter latches of W,/W, = w/w = 1.411.4, a frequency of 
0.22 GHz is obtained at a gate area of about 25 pm? Further increase in the sizes of the 
C-elements results in failure. If we use symmetric latches with invertas of W,/ W, = 1.413.5, 
we can achieve a wider range of fiequencies up to 0.3 GHz, &a which increasing the sizes 
of the Gelements leads to failure. To get higher fiequenues we have to use larger symmehic 
latches. For example, by using latches with W,/W, = 317 a fiequency of 0.32 GHz is also 
achievable. The draw-back is that each t h e  we use a larger latch, we get to a higher level 
of energy consumption, as shown in the graph, because we are increasing the amount of 
parasitic capautance. 
The DILN implementation is mach more robust than the DILP implementation in a 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 93 
50 I I I I I I 1 I 
Inverter Latch Wn/Wp= 1 .W .4 
Frequency (GHz) 
Figure 5.8: HSPICE results for the DILN C-element implementation. 
micro pipeline environment and rarely fails, because, like the basic DIL implement ation, 
it is active-high and its rise-the delay is larger than its f d - t h e  delay. The fkequency 
versus gate area and energy versus frequency gaphs for the DILN implementation are shown 
in Figure 5.8. With 1.411.4 latches, the complementary outputs are far apart and the 
performance is very poor. Ushg 1.413.5 latches helps to increase the frequency up to 0.22 
GHz at an area of arotmd 30 pm2. Further inuease of the area is counta productive and 
reduces the  fkequency, becanse the feed-back signals becorne so fast that they slow down the 
full generation of the fornard signals. Increasing the size of the latch is of some help, since 
it inmeases the race and slows down the feed-back signals. The energy inmeases a lot by 
using larger latches withont much gain in frequency. In order to avoid counter-productive 
interaction of signal5 in a micropipeline control circuit using the DILN implementations 
one can add a couple of inverters to the outputs of each DILN implementation. We c d  
this implement ation the MDILN implement ation. The MDILN implement ation for which 
the simulation r d t s  are reported in the next section is designed with a 1.413-5 latch and 
output inverters with N- and P-transistors 2.5 t h e s  larger than the N-tree transistors (i.e. 
u = 0.4 and q = 1). 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 94 
5.4 Results and Discussion 
This section presents the results of comparing the optimized single-rail and differential im- 
plementations of the C-element for the two typical environments introduced in the previous 
chapter. 
Figure 5.9 depicts the energy-delay graphs under the fkst test (with fansut of three 
invert ers) based on simulation result s for all of the Celement implementations considered. 
The best delay obtained by a conventional implementation is 1.15 ns and belongs to the 
symmetnc C-element. The same delay c m  be obtained using the DILP or DILN irnplemen- 
tations with 20% less energy consumption. Delays below 1.15 ns to 1.0 ns could only be 
obtained with the Werential implementations. For a lower number of fan-outs, the results 
are even more in favour of the DILP and DILN implernentations. The DILP and DILN 
C-elements for t his environment are implemented wit h minimum size latehes. 
I 1 1 1 1 I I I I I I 
0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35- 1.4 1.45 1.5 
Propagation Delay (ns) 
Figure 5.9: HSPICE results for the single-rail and double-rd C-element implementations 
under the first test. 






50 - 45 c, a - 40 
35 aa 







Figure 5.10: Simulation results for the single-rd and double-rail CMOS implementations of 
- &stage Micropipeline Control Circuit Asymptote - 
- 0.8 micron BiCMOS Technology 9 y - 









- Weak Feedback - 
- - 
- - 
I I I I 1 I I I 
the C-element in a micropipeline environment. 
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 
The overd results of the simulations for the singlerail and double-rd implementations 
of the C-element under the second test are depicted in Figure 5.10. It shows the energy- 
fiequency carves for the four single-rail implementations and of the MDIL and MûILN 
implement ations. The cufves for the O ther double-rail implement ations are omit ted, because 
of poor performance. (The m e s  for the DILP and the DILN implementations are depicted 
separately in Figure 5.7 and Figure 5.8 respectively.) The performance of each implementa- 
tion has been optimized according to the preceding discussions. For every implementation, 
each data point is taken at a ciiffixent size. As we go from the left to the right dong the 
cnrves, the size of the implementations increase, yielding higher fiequencies at higher energy 
costs. Of course, in an energy-ikequency graph U e  this one, the outer cwes indicate better 
energy-Fequency trade-o&. The m e  for the dynamic implementation is also shown in 
the figure as a representative of an almost ideal case, and to measure the devïation of the 
CHAPTER 5. DIFFERENTIAL IMPLEMENTATIONS OF THE C-ELEMENT 96 
performance of each implementation kom this more ideal behaviour. The dashed vertical 
line at 0.45 GHz, i.e. the highest frequency obtainable with the dynamic implementation, 
is the asymptote for the performance of the singie-rail irnplementations. The double-rd 
implementations seem not to be able to cross that line either. The MDILN implementb 
tion cornes very dose, but fails afterwards, because of the divergence of the complementary 
outpnts. In tems of energy and frequency, the symmetric C-element outperforrns the other 
single-rail implement ations. However , frequencies o v a  400 MHz are only ob tainable with the 
double-rail implementations. The MDILN implement ation offas the best energy-frequency 
trade-off followed by the MDIL implementation. 
5.5 Concluding Remarks 
We have presented a cornparison of the optimized CMOS implementations of the C-element 
in terms of performance and energy. Beside the dynamic implementation of the. Celement, 
we have addressed eight st atic implementations, four of which are single rail and the rest are 
double rail. The singlerail implementations are the standard, weak feedback, resistive, and 
symmetric Gelements introduced in the previous chapter. The doublerai1 irnplementations 
are the DIL implementation of Figure 5.1, the MDIL implementation of Figare 5.3, and the 
DILP and DILN implementations of Figure 5.6. We also bridy mentioned the MDILN im- 
plement ation, which is basicdy the DILN implement ation with a couple of output inverters. 
Ways to optimize the performance of the various implementations have been discussed. One 
shodd be very cautious when using the differential Iogic implementations in pipeline8 in- 
duding feedback loops, as we have observed some divergence phenornena with these &cui ts  
leading to failare. Although we have explained methods for reducing the problem, farther 
study is needed to fdly explain this behaviour. 
Chapter 6 
Delay Modeling at the Device and 
Switch Levels 
This chapter presents a new, unified, and simple model for estimating the charping and 
discharging delays of an MOS transistor, whether of NMOS or PMOS type. In other words, 
this model covers the delay of an MOS transistor for tramferring both a logic 1 and a 
logic O. Armed with such a model, it is possible to predict the delay of a CMOS logic 
gate irnplemenked in any style. In conventional CMOS logic gates, NMOS transistors are 
solely used to discharge the output capacitance, i.e. to transfer a logic O to the output. 
Correspondingly, PMOS transistors are dedicated to charging the output capacitance, i.e. to 
transfer a logic 1 t o  the output. In general, both types of transistors may be used to carry 
a logic 1 or a logic O, such as in a CMOS transmission gate. This idea has led to a number 
of innovative CMOS logic styles, such as complementary pass-transistor logic (CPL) [105] 
and the differential cascode voltage switch logic (DCVSL) [42]. NMOS transistors are also 
widely used in latches and fîipflops for passing both 1's and 0's. For devdoping a delay 
model which covers the various combinations of the transistors in a circuit, we basically 
need to predict accurately the cuxrent through NMOS and PMOS transistors daring both 
charging and discharging scenarios. 
First, we introduce a new and simple model for the saturation cnrrent in MOS transistors 
and demonstrate that it is more acnuate than the model proposed in the literature with the 
same level of complexity. Then, we argue that the same model can be applied to predid the 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 98 
current in an NMOS transistor when charging a node and, similady, to a PMOS transistor 
when discharging a node. We are not aware of any such model in the literature, except for 
the very rough techniques discussed in the text books like (721. Next, we discuss the efFect of 
finite input signal slope. This is foilowed by the extension of a ramp delay model to circuit 
scenarios involving overlapping and opposing currents. After t hat , we further extend the 
delay model ta cover a ch& of series-connected transistors. Finally, we present a procedure 
for extracting the empirical parameters of our model with HSPICE simulation. 
Note that starting from this chapter, we use a 0.5 pm CMOS technology for modeling 
and HSPICE simulations, instead of the 0.8 Fm BiCMOS technology used in the previous 
chapters. 
Since we deal with two types of transistors and two types of delays for each transistor 
type, we need to establish a notation. We use the subscripts "n" and "p" to indicate 
correspondence of a parameter to NMOS or PMOS devices, respectively. We also use the 
accents * ' and ' to indicate correspondence of a parameter to rishg or falling delay, 
respect ively. 
6.1 On Short-Channel Effects 
Shoddey's square law over-estimates the saturation curent in short-channel devices. The 
major reasons for this are the velocity satu~ataon and the mobility degradation efEects, which 
are not accounted for in the square-1aw. While the former &ect is attributed to the campe 
nent of the electric field dong the chaflllel, the latter &t is attribnted to the component 
of the electric field perpendidar to the Channel. 
The velocity of the carriers v is proportional to the applied electric field E as long as E 
is less than East z 10S Vlan. The constant of proportionality is the carrier mobility. For 
larger values of E, the d e r  velocity tends to saturate at aromd 10' cm/s. This ened 
severely limits the saturation m e n t  in MOS devices. 
On the other hand, since the oxide thickness is also scaled down besides the channel 
length, the effective carrier mobility in short-chamel devices degrodes due to the vertical 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH L E W L S  99 
component of the electric field. It is known that the effective surface mobility depends on 
the vertical electric field according to the following approximation [62] 
Where is the low-field surface mobility and 6 is an empirical factor with a typical value of 
around 0.2. Then, taking into account the effects of the lateral field and velocity saturation, 
the effective mobility, as used in SPICE Level 3, can be approximated by [62] 
where v,, is the maximum drift velocity of the carriers and, in saturation, VDs is replaced 
by its saturation value. 
6.2 MOSFET Delay and Curent 
The delay D of charging or discharging a capacitance C to VDD/2 thtough a transistor of 
width W is governed by the following general delay expression. 
We prefer t O combine all supply voltage and technology dependent parameters in v and write 
the delay expression as above. This form also dearly shows the relation between the delay 
and the width of the transistor. 
Before proceeding any fnrther with the details of MOSFET current models, let's see 
which mode of operation of the MOSFET is of interest to us. Figure 6.1 illastrates the 
cwent-voltage behaviour of a 0.5 pm NMOS transistor. The horizontal arrow in the figure 
shows the trace of the value of the nurent through the transistor when t r d d g  a logic 
O half-way throngh. That is, from the time the output is at VDD = 3 V antil it is dischargecl 
to half VDD. At the beginning of a logic O hansfer, VDs and Vas both eqaal VDD. At the 
end of the trader, VDs e q d s  0.5vDD, while Vos remains e q d  to VoD. h short-channe1 
devices, th& to their relativdy extended saturation region, this process takes place alnios t 
CHAPTER 6. DELAY MODELING AT THE DEWCE AND SWITCH LEVELS 100 
6 1 1 1 1 1 1 1 1 
Figure 6.1: IV-characteristic c iwes  for a 0.5 pm NMOS transistor showing the change in 
current during tramferring a logic 1 and a logic O. 
entirely in the saturation mode and, hence, the d u e  of the current remains constant, as 
shown by the arrow. The oblique arrow in the figure shows the trace of the value of the 
current when traderring a logic 1 half-way throagh. That is, from the t h e  the output is 
at zero volts until it is charged to half VDD . The initial condition at the beginning of a logic 
1 transfer is exactly similar to that of a Iogic O transfkr. At the end of a logic 1 transfer, 
however, VDs and Vos both reduce to half VDD. In this case, although the d u e  of the 
current decreases, the device still remains in the saturation mode, as shown by the arrow. 
Note that the presentation in this figure is only snperficidy true, because it doesn't show 
the body eftect. 
A PMOS transistor has a similar behaviour. Therefore, whether it is a rising delay or a 
f f i g  delay that we want to calcalate, the murent in the general dday expression (6.3) is 
some average saturation drain m e n t  ID. Hence, we may Iimit otu search for a simple and 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 101 
accurate MOSFET model to the saturation region. 
6.3 A New Mode1 for MOSFET Saturation Current 
Shockley's model stipulates that the saturation current is proportional to (Vcs - VT)2, that's 
why it is also known as the square law. The pop& a-power law proposed in [76-781 argues 
that due to the velocity saturation efEects, the saturation carrent in modern devices follows 
Where P, is a technology-related parameter and a is a number between 1 and 2 called the 
velocity satutation indez. For long-channel devices a = 2 and the model is reduced to that of 
Shockley's square law. For short-channel devices under velouty saturation d e c t  a is doser 
to 1. Similar models in very recent publications daim that a equals 1.25 (141 or 1.3 [13j for 
deep sub-micron devices. S tudies presented in [76] for devices as narrow as 0.5 Pm show 
that the a-power approximation is generally good. 
Figure 6.2 depicts the relation between the drain current and the gate voltage in the 
saturation mode for an NMOS transistor. The solid line in the figure is obtained by HSPICE 
simulation Level13. The dashed lines with and withoat dots are obtained using the a-power 
law. Since the a-power law has two parameters, namely Pc and a, you can obtain th& 
values by fit ting the c w e  into the simulation data at two dSerent points. The dashed line 
is the r e d t  of this fitting at Vos = 3 V and VGs = 4 V which leads to a = 1.03. With this 
d u e  of a the model loses its accuracy as Vos decreases and approaches VTN. It is cl- that 
at Vas = 1 V the error is over 50% compared to sidation. If we try to fit the model at  
lower values of Vos, for example at Vas = 1 V and Vos = 2 V, we end-up with the dashed 
h e  with dots. We have to increase a fkom 1.03 to 1.46. Now the model becomes inaccnrate 
towards higher values of VGs and, for instance, over-estimates the d u e  of current as much 
as 33% at VGs = 4 V. Hence, a dearly needs to be decreased as Vos increases. 
As Vos increases, it has an inmemental and, at the same tirne, a dememental effect on 
ID. Lncreasing Vos, on the one hand inmeases the ntunber of the mobile charges in the 
channel which boosts the ment;  and on the other hand, it rednces the effective mobility 
CHAPTER 6. DELAY MODELING AT THE DEVlCE AND SVVITCH LEVELS 102 
V,, : Gate to Source Voltage ( V ) 
Figure 6.2: Simulated and calculated d u e s  of ID as a function of Vos nsing the a-power 
law and the new model. The threshold voltage of the device V T ~  0.66 V. 
of the carriers which limits the current. The combination of these two physical phenornena 
do not seem to be accurately reprodaced by the a-power law. The a-power law only models 
the inmemental efKect of Vcs on Io. 
As an extension of Shockley's square-law and Sakuai's a-power law, we are proposing the 
following model for the saturation drain m e n t  of a MOSFET. 
Where a has been replaced by the Vos dependent index 6 + B/VGs. NOW the index com- 
bines both short channe1 &ects, velocity saturation and mobüity degradation. The velocity 
saturation d e c t  is captured by E and the mobility degradation &et is captured by B .  The 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SVVITCH LEVELS 103 
- - - .  D = ( C  VDD/2 ) / [0 .124  W ( V m -  v , ) ' . * ~ ]  - 
. - . - . -  O = ( C  VDD/2)/[0.1OO W(Voo- v,) 1.36] 
D = ( C  Vm/2 ) / [0 .121  W(V'&- V,) 1.12+0.4û/V 00 1 - 






**: 1 I I I 1 I 1.5 2 2.5 3 3.5 4 
: Power Supply Voltage ( V ) 
Figure 6.3: Step delay of an NMOS transistor discharging an output capautance as VDa 
changes. HSPICE simulation results are compared with the results obtained by the a-power 
law and the results obtained by the new model represented by s m d  d e s .  
parameter n depends on the technology and the effective channel length. We've hidden the 
channe1 length inside n, because digital designers rardy use transistors with channel lengths 
larger than the minimum feature length specified by the technology. This model has about 
the same complexity as the previous models nsing a constant index, but it is more accurate. 
Since this model has an additional parameter, we need thtee points for curve fitting. The 
m e  represented by s m d  cirdes in Figure 6.2 is obtained by fitting the model at VGs equal 
to 1, 2, and 3.5 volts. The model nicely reproduces the saturation region characteristics of 
an MOS device. 
Fignre 6.3 shows s cornparison between the HSPICE s ida t ion ,  the a-power model, and 
the new model in predicting the delay of discharging an output capacitance by an NMOS 
CHAPTER 6. DELAY MODELING AT THE DEWCE AND SWTCH LEVELS 104 
transistor at difTerent power supply voltages. Although we had the values of the parameters 
a, Pc, t, il, and rs from the previous experiment, we repeated the cuve-fitting process here 
to obtain better matches. The minor differences in the values of the extracted parameters 
compared to the previous experiment are due to the overshooting that appears in the output 
voltage signal through the coupling capacitances. The simple delay equation does not include 
this effect. The or-power law curve is done for two sets of voltages, different from the previous 
experiment. The dashed line represents the delay obtained by the a-power law fitted at the 
points of VDD eqnal to 2 V and 3 V, while the dashed line with dots represents the delay 
obtained by t h e  same model fitted at the middle points of VDD equal to 1 V and 4 V. For our 
model, the fitting points are the same as in the previous experiment, i.e. VDD equals to 1 V, 
2 V, and 3.5 V. Note that the values of a obtained here are very dose to those suggested 
in [14] and [13]. This experiment confirms the superiority of the new model compared to the 
a-power law in predicting the dday over a wide range of the supply voltage. 
6.4 Generalization of the Mode1 
In reference t o  Figure 6.1 we mentioned that even if an NMOS transistor is used to transfer 
a logic 1 instead of a logic O, o u  concern would still be the saturation current. This suggests 
that we might be able to use the same model as well. What follows explains our attempt in 
extending the model to predict the delay of an NMOS transistor when charghg a capacitance. 
Figure 6.4 illustrates the simulation and o u  modeling results for this case a variety 
of supply voltages. The solid Iine is, as usual, obtained by HSPICE simulations. At the 
initial state of the discharging process, Vos = Vos = VDD. While the capacitance is being 
discharged, both Vos and VDs decrease, but remain equal. At the end of the process, the 
output reaches VDD/.! = Vos = VDs. Note that the minimum VDD applied here is 1.5 V 
rather than 1 V, becaase VDD hm to be at l e s t  2VTN for the output to reach VDD/2. As our 
first attempt, me use the same egaation as the f a g  delay, but replace VDD with 314 v&, 
i.e. the value of VGs at the middle of the charging process. The resnlt of this technique 
is shown with the dashed line. Kt loses its acwacy very fast as it approaches the loaer 
end of the VoD scale. One important factor that we have ignored is the body eEect, which 
dynamieally inaeases V T ~  dnring the chargi~g process. Hence, in the next attempt, the 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS r05 
a 1, = O. 121 W (384 Vm - VTN) 1 . 1 2 + 4 / 3 0 . 4 0 / ~ ~ ~  - - - - - . -  ID=0.121 W(W4 vDD- V' )1-12+(Ro."'vD* 1OOOO 0 lD=0.014 W (  VDD - V,) TN1.32 + 4.70/VDD 
HSPICE Simulation Level 13 
vmi=V'r,,,+'(l-2q,F+ v ~ I ' ~ - I - ~ ( ~ ~ ' ~ )  
D = ( C  V,, /2)/ ID 
Vss = 1/4 V,, 
C =  1000fF 
W =  Ion  
y=  0.60 
Figure 6.4: Step delay of an NMOS transistor charging an output capcitance as VDD 
changes. HSPICE simulation results are compared with the results obtained by the new 
model. 
result of which is shown by the dashed line with dots, we replace VTN by ViN to indude the 
body efEect. The following, generally known [72], expression is used. 
Where 7 is the body dec t  co&cient, h is the Fermi potential, and hB is the source to 
bulk potential. Midway in the process, VS8 = 114 VDD. This technique almoet solves the 
problem for higher values of VDD, but the m e n t  behaviom seems to be so non-Iinear that 
the technique obvioasly f& at Iowa values of VDD. Observing that the shape of the delay 
m e  resembles that of the discharging case, the final ehoice is to use direct fitting d the 
model into the simulation data. Thas, we define a separate set of parameters rs, 6 and 6 for 
the case of the rising delay. This method is highly suc ces^ as depicted in the figare by 
CHAPTER 6. DELAY MODELllVG AT THE DEVICE AND SWTTCH LEVELS 
1.38 + 0.38 / V' pa~sw(v, , -v , , )  O - HSPICE Simulation 
I 1000 -. PMOS Rising Delay cn 
cd 
I 5 0 0 -  n f i  Y Y n m A h  W b " h h A  
A 
V y 
1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 
vDD : Power Supply Voltage ( V ) 
Figure 6.5: Charging and discharging delays of PMOS transistor as obtained by simulations 
1 I I I I I 
and using the model. 
1.51 + 6.86/ VDD 
O ID = 0.003 W ( VDD - V P )  
- HSPICE Simulation 
the s m d  cirdes. Notice that, comparing to the cases of the falling delay, C is only slightly 
larger, whereas 9 is larger by an order of magnitude. This shows that the NMOS rising delay 
is much more sensitive to VDD changes than its f a h g  delay. 
- 
Figare 6.5 shows that the model is equdy valid for PMOS transistors in the charging 
and discharging cases. Note that for the PMOS rising delay is larger than [ for the NMOS 
f d h g  delay; this indicateil that the velocity saturation d e c t  is less sevae in the PMOS 
- .- 
PMOS Falling Delay - 
Q = 2000 - h V h h *  - . . u y - n  U Z ~ = * ^ ^  
= ' O  I 1 t I 1 1 I I I 
2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 
VDD : Power Supply Voltage ( V ) 
transistors of this technology. We shodd mention that while 6 and 19 of the NMOS falling 
delay and PMOS nsing delay are directly related to the corresponding saturation carrent8 
and are assouated with some physical phenomena, these parameters for NMOS rising delay 
and PMOS falling delay are not directly rekted to the same phenomena- 
CHAPTER. 6. DELAY MODELING AT THE DEWCE AND SVVITCH LEVELS 107 
6.5 Effect of Input Waveform Slope on Delay 
The simplest way of accounting for the d e c t  of the input signal slope on the delay, is to use 
an approximation of the form [40] 
This technique adds a fraction S of the input rise or fall time T to the step delay D to obtain 
the ramp delay DT. Tt has been mentioned that this approximation is valid if the input slope 
exceeds one-third of the output slope [40]. On the ground that this d e  is usually hue in 
VLSI circuits, one may use the above formula. Othenvise, the input slope efl'ect is much 
more complicated as analyzed for example in [7,31,49,18]. In fact, if the input changes too 
slowly, the delay rnay be negative. When applying the above formula to a CMOS inverter, 
an additional source of error is the short-circuit current, which increases as the input slope 
decreases. Using his a-power law, Sakurai has derived the following expression for S, [76]. 
This is similar to the expression derived in [40] for long charnel devices, where a = 2. We 
use this expression in our work. However, instead of having an a value for NMOS devices 
and one for PMOS devices, we introduce four values for a corresponding to the four types 
of the delays. Figure 6.6 illustrates the relation between the four delay types snd the input 
signal slope. The figure compares the simulation results with the above expression for the 
four cases at two different VDD settings. The value of a for each case is stated in the figure's 
caption. This experiment demonstrates th& with the right values of a, the same delay 
mode1 as (6.7) can be used for charging and discharging cases through NMOS and PMOS 
transis tors. 
For delay caldations of cascaded gates, one shodd relate the input transition time T of 
a gate to the step delay of the preceding, Le. driving, gate. Although (761 has saggested a 
formula for this purpose, our simulations show that a step delay is approlemately equivalent 
to half of the corresponding transition t h e .  Hence, the delay of a gate i can be expressed 
in terms of its own step delay Di and the step delay of it driving gate Di-l as follows. 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWTTCH LEVELS 108 
NMOS Falling Delay 
O Io00  2000 
lnput Rise Tirne (ps) 
NMOS Rising Delay 
500 
O 1000 2000 3000 
Input Rise Tirne (ps) 
PMOS Rising Delay 
Input Fall Time (ps) 
PMOS Falling Delay 
14000 1 
lnput Fall Tirne (ps) 
Figure 6.6: A cornparison between the results of HSPICE simulations and the delay mode1 
for the four delay types: NMOS Fsrllirig (a = 1.2), PMOS Rising (a = 1.35), NMOS Rising 
(a = 2), and PMOS Falling (a = 2). 
Where S = 25, is called the slope factor. We define a slope factor for each of the four delay 
types. 
6.6 Overlapping and Opposing Currents 
In this section we discuss modifications of the deiay expression for two cases which are 
common in digital CMOS circuits, namely, the cases of overlapping and opposing ments .  
A prominent example of the former is a CMOS transmission gate which is widely used in 
memory structures and mnltîpiexers. Examples of the latta include level resorters, latches, 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 109 
t 1 I t t 1 I t 1 1 t 
1 1.25 1.5 1.75 2 2.25 2.5 2-75 3 3.25 3.5 3.75 4 
v~~ : Power Supply Voltage ( V ) 
Figure 6.7: Variations of the step and ramp (T = 1 ns) delays of a CMOS transmission gate 
(W, = 10 pm and W, = 20 pm) discharging an output load (C = 1 pF). 
and DCVSL gates. 
Let's consider a CMOS transmission gate with a PMOS transistor of width W, an NMOS 
transistor of width Wn, and an output load C. We develop an expression for the falling delay. 
A s i d a r  argument can be made for the rising dday. Since both transistors cooperate to 
discharge the output, the step delay can be simply written as 
Next , we consider the eflect of the input signal slope. Predicting the exact &ect of the inpat 
slope is very wmplicated, because in practice the arrival time and the slope of each input 
signal may be different fkom the other. We can, however, make reasonable assumptions. For 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SIVITCH LEVELS 110 
al l  the cases when the ciifference between the arrival times of the two input signals is much 
smaller than the delay, we may assume that both transistors are activated at the same time 
wi t h a similar rise and f d  tirne T .  This is the ideal case of operation of a transmission gate. 
At the other extreme, for all the cases when the input signals are far apart relative to the 
delay. we may assume that only one of the transistors has contributed to the delay. This 
situation is avoided, because it defies the purpose of having both a PMOS transistor and an 
NMOS transistor in the circuit. For the ideal case, based on intuition, we propose that the 
slope factor S of each transistor should be weighted by its share of the total curent. Hence, 
the overall slope factor is approximated by 
Thus, the value of the total slope factor is between the values of the individual slope factors. 
Note that this is valid independent of the formulation used for calculating the individual 
slope factors. Therefore, the total faIling dday is given by 
Figure 6.7 illastrates the close match achieved between this mode1 and the results of HSPICE 
simulations for both step delay and ramp delay with T = 1 ns at diff'ent supply voltages. 
Now we turn to the case of opposing currents. In this case, a network of one or more 
transistors hies to charge (discharge) a node whüe, simaltaneously, the node is being dis- 
charged (charged) by, usudy, a single transistor. The opposing transistor is often aheady 
active when the network is activated by an arrïving input signal. For example, assume a 
capautance C which is akeady charged by a PMOS transistor of width Wp and is about 
to discharge through an NMOS transistor of width Wn while the PMOS transistor remains 
active. For the step delay, we simply subtract the two carrents and obtain 
For the ramp delay, we take a similar approach as in the case of the transmission gate. Since 
the input is only applied to the NMOS transistor, we weigh its dope factor by its share of 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 
2500 1 1 1 I 1 1 1 
O Mode1 
HSPlCE Simulation 1 1 
250 1 I 1 1 1 t I 1 I 1 4 1 
1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 
vw : Power Supply Voltage ( V ) 
Figure 6.8: Variations of the step and ramp (T = 1 ns) falling delays of a CMOS structure 
invoIving opposing cments (W, = 10 pm, W, = 20 pm, and C = 1 pF). 
the total current , and calculate t the total slope factor by the following expression. The total 
delay can, then, be calculated using 6.12. 
Therefore, the slope factor for the delay through a transistor infîuenced by opposing currents 
is higher than the delay through the same transistor operating fiee of such an influence. Fig- 
ure 6.8 shows the match achieved between this mode1 and the r e d t s  of HSPICE simulations 
for both step delay and ramp dehy with r = 1 ns at different sapply voltages. 
The technique described in these two examples can be applied to a Yariety of situations 
involving two types of cunents. The important point is the ability to accurately predict 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 112 
the amount of each current. This is faa ta ted by asing the curent mode1 presented in the 
previous sections. 
6.7 MOS Transistors Connected in Series 
One of the complications involved in modeling transistor circuits is delay estimation through 
a diain of series-connected devices. In such cases, usudy, an equivalent RC network is used 
for modeiing the delay. We assume a d o r m  size for the series-connected transistors in a 
gate. Another choice is the so called graduated [89] or progressive [72] sizing of cascaded 
transistors. An analytical procedure for progressive MOSFET sizing is presented in [8] and 
used in [100]. Progressive transistor sizing seems particularly usefid in designing dynamic 
CMOS logic gates [89,100,104]. For today 's short-channel devices, however, the literature 
sugges t s only marginal improvement in the delay by using progressive transistor sizing over 
uniform transistor sizing. In [45], it is reported that for a 3-input NAND gate, the graduated 
version shows less than 4% and 2% delay Mprovements over the d o m  version for in 
1.5 pm and 1 pm technologies, respectively. Hence, this delay improvement shodd even be 
less for deep submicron technologies . Besides , progressive transistor sivng increases the ares 
and enagy consumption of a gate. 
Consider the RC network of Figure 6.9. Assume that initidy al1 nodes are precharged 
to VDD. In 1948, b o r e  [35] showed that the voltage drop at node n, after a zerc~going step 
is applied at the input, is govemed by the following dominating t h e  constant. 
Rubinstein, Pedeld, and Horowitz generalized ELnote's theory for an RC tree. They also 
provided precise lower and upper botmds on the voltage waveforms in an RC tree [46,74]. 
Many cornpater-aided timing analyzers have incorporat ed these t heories. 
Elmore's approximation is very u s a ,  provided that the values of all resistances and 
capautances are known. A major obstacle, however, is the determination of these value, 
especidy those of the resistances. Sakarai and Newton have shown that short-charnel 
devices do not obey the RC models used to predict the delay in series-connected long-Channel 
CHAPTER 6. DELAY MODELING AT THE DEWCE AND SWITCH LEVELS 113 
Figure 6.9: A general RC chah. 
MOSFETs [77]. In a pull-up or pull-down chah of transistors. usually the transistor dosest 
to the output operates in the saturation mode and the test operate in the linear mode [77]. 
We use this assurnption, combined wit h Elmore's t  heory, to simplify the delay modeling. 
This is reflected in Figure 6.10, where a chah of series connected transistors are replaced by 
an RC chah, and Ri stands for the equivalent resistance of a transistor in the linear mode, 
R stands for the equivalent resistance of a transistor in the saturation mode, C stands for 
the dinusion capacitance per transistor, and CL stands for the load capacitance. Applying 
Figure 6.10: An equident RC chin for series-connected MOS transistors. 
b o r e ' s  approximation gives 
which resdts in the following, if CL is mach larger than C. 
where y = &IR. On the other hand, for very large CL, simdations show that the delay of n 
series-connected transistors is proportional to the delay of a single transistor with the same 
load through the voltage-dependent delay degradation factor Y. 
Y = 1 + ( N - l ) ( a  - VDD 
b ~ )  
CHAPTER 6. DELAY MODELING AT THB DEVICE AND SWITCH LEVELS 114 
Where a and b are empirical parameters with the typicd values of 1.2 and 0.1, respectively. 
Comparing (6.19) with (6.18) shows that 
Replacing Rl with y R in (6.17) yields 
T = R (XC + YCL)  (6.21) 
Where X = y(N2 - 1) + 1 and Y = y(N - 1) + 1. Hence, using our formulation, we may 
express the delay through N series-comected transistors by 
The values of a and b are obtained through simulations for each case. When CL dominates, 
X can be set to zero. For fast hand-analysis, Y = N is a good approximation. 
Before concluding this section, we should mention an important point deduced fiom 
(6.22). Note that the diffusion capacitance C is proportional to the transistor width W. 
Hence, the fist term in (6.22), that is, the term related to the internai nodes, is a constant 
and does not depend on W. Therefore, in pull-up and pull-down transistor chahs, only 
the load capacitance determines the optimal transistor sizing for minimizing the delay, and 
the interna1 nodes can be ignored in that regard. We rely on this finding in dealing with 
optimization of CMOS logic circuits in the next chapter. 
Extraction of Delay Parameters 
characterize the delays in a parti& technology, we shodd ertract four sets of pa- 
rarneters corresponding to four types of currents or delays. The symbols for these parameters 
and their d e s  for ont technology are listed in Table 6.1 and Table 6.2. 
The following steps sammarize the parameter extraction procedure. These steps have to 
be repeated for each of the four dday cases. 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND S'WITCH LEVELS 115 
Table 6.1: Current Mode1 Parameters, Symbols, and Values for a 0.5 pm CMOS Technology. 
1 Delay Type II K ( r n ~ / p n / ~ - ~ - ~ l ~ ~ ~ )  1 E 1 8 (V) 1 
1 D,: NMOS Falling II k,, = 0.12065 1 in = 1.1265 1 d, = 0.4061 1 
1 &: PMOS Rising II R, = 0.04564 1 (, = 1.3765 1 8, = 0.3756 
1 h: NMOS Rising II & = 0.01408 1 in = 1.3203 1 e, = 4.6958 1 
1 D,: PMOS Falling II a = 0.00295 1 i, = 1.5065 1 8, = 6.8598 1 
Table 6.2: Delay Degradation Parameters, Symbols, and Values for a 0.5 p m  CMOS Tech- 
nology. 
1 b,,: NMOS Falling 11 à, = 1.20 1 à, = 1.12 1 b, = 0.10 1 
1 i),: PMOS Falhg (1 à, = 2.00 1 & = 1.45 1 b, = 0.09 ( 
D,: PMOS Rising 
bn: NMOS Rising 
Select a relatively large output capacitance C (e.g. 1 pF) and a reasonable transistor 
size W (e.g. 10 pm). 
Measure the step delays Di, D2, and D3 at three different supply voltage settings K, 
h, and &, respectively. 
4 = 1.35 
9, = 2.00 
Cdculate and B fkom the following set of equations. 
Vz - VT 
+ d log [l*] (Vt-v!) = log (2) - hg ($) 
L& = 1.45 
on = 1.25 
K - VT 
log (VI - vT) + 8 log le] (&-vT) 2 = log (5) - log (2) 
6, = 0.14 
6, = 0.09 
Cdculate K fiom 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 116 
5. At VoD = V2 apply a ramp signal with rise t h e  or fall time T (e.g. 1 ns) and measure 
the delay DT . Calculate a fiom 
6. At Vl measure 
nected in series 
the step delay DIl due to two identical transistors of width W con- 
to C. Similady, at V2 rneasure the step delay D22 due to two identical 
transistors of widt h W connected in series to C. Calculate a and b fkom the following 
set of equations. 
6.9 Extraction of MOSFET Capacitances 
The gate and diffusion capautances in MOS devices are both voltage dependent and, thus, 
have different average vaiues during the rising delay and the falling delay. We use &, Q,, 
A, and in corresponding to gate capscitance per unit width for PMOS nsing delay, PMOS 
f a n g  delay, NMOS rising delay, and NMOS falling delay, respeetively. Similady, $, 4, 
d,,, and d, denote difhsion capacitances per unit width for the above cases, respectively. 
The values of these parameters for our technology are listed in Table 6.3. 
Table 6.3: Gate and Diffusion Capacitances p a  Unit Width: Symbols and Values for a 
0.5 pm CMOS Teehnology. 
PMOS / Falling II &, = 1.35 1 $ = 3.00 1 
PMOS / Rishg 
NMOS / Rising 
6, = 2.25 
a,, = 1.55 
d, = 2.50 
d,, = 1.75 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 117 
It is important to note that for energy calculations the average gate and diffusion ca- 
pacitances oves one switching cycle should be used. The average PMOS and NMOS gate 
capacitances per unit width are given by 
respectively. Similarly, the average PMOS and NMOS Wusion capacitances per unit width 
are given by 
respec tively. 
For quick estimation of the delay and energy, one may use the following values for a 
transistor gate capacitance per unit width and a transistor diffusion capacitance per unit 
width, respectively. 
This is the approximation used in the formulation of Chapter 3. For the 0.5 pm CMOS 
technology, according to Table 6.3, g = 1.85 fF/pm and d = 2.13 fF/pm. For example, if 
the load of a logic gate entirely consists of gates of NMOS transistors, this approximation 
gives about 20% error in estimating the rishg step delay due to the load. Therefore, the 
approximation is, particularly, no t suit able for PTL circuits. 
Consider the inverter chin of Figure 6.11. The rishg and falling propagation delay 
between the input and the output of the c d  are given by 
+ ~ n " ( ~ p ~ + ~ n ~ + ~ ~ 4 + ~ h ~ )  w, (6.27) 
The following procedure may be used to extract the values of the MOS capautances. 





Figure 6.11: A CMOS inverter chah. 
1. Initidy, let W, = Wo, = Wb = Wlp(= 20pm) and Wn = Wb, = Wh = Win(= 
l0pn).  Measare D~ = D& and D~ = D ~ .  
2. Let WLp = Wa(= 30pm). Measure b2 and B2. Then, calculate the foUowing. 
3. Reset d transistor sizes back to Step 1. Let Wh = W&(= 20pm). Measme & and 
i)3. Then, calculate the following. 
w, DI  - b3 & = -  
fi* Wl, - W .  
4. Reset all transistor sizes back to S tep 1. Let W, = W4J= 30pm). Measare b4. Then, 
calculate the following. 
5. Reset aIl transistor sizes back to S tep 1. Let Wn = Wsa(= 2 0 ~ ) .  Measure fi5. Then, 
calculate the following. 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWITCH LEVELS 119 
6. Reset ail transistor sizes back to Step 1. The only unknown parameters are <i, and 
8. Change the width ratio of PMOS transistor to NMOS transistor in the inverters 
from 2, as in Step 1, to 1.5. Measure bs and bB. Express D~ - bg and bl - D ~ .  
Çolve these two equations for the two unknowns. 
A diffusion junction capacitance is a non-linear function of the voltage across the junction. 
It is possible to obtain an average value for Cj over the voltage range of interest, Vi to K r  
using the following. 
Where Cja is the value of the capacitance at zero potential, & is the junction's built-in 
potential, and rn is the grading coefficient. The above expression yields: 
For energy calculations we are interested in the d u e  of the junction capacitance over the . 
range of f d  voltage swing CjF8. 
For delay calculations, however, we are interested in the value of the junction capacitance 
over the range of half voltage swing. Hence, we need to have two expressions: one for the 
rising delay 
and one for the falling delay 
We presented a method of extracting the diffusion capacitances for a reference sapply 
voltage through simulations. Comparing the ca lda ted  and extracted values for the refaence 
VDD (3 V in our case), gives as an adjastment factor. For a different VDD, fist we c d d t e  
the diffusion capacitances using the above formulas. Then, we mtiltiply the adjustment factor 
with the calculated results to obtain the values of the difftrsion capacitances for using in oar 
CHAPTER 6. DELAY MODELING AT THE DEVICE AND SWTCH LEVELS 120 
6.10 Concluding Remarks 
In this chapter, we have presented a discussion on the components of a new, unified delay 
mode1 for CMOS circuits, which is applicable to pull-up and puil-down chahs of both PMOS 
and NMOS transistors. This is especidy us& in pass-transistor circuits, where NMOS 
transistors usudy are used to transfer a logic O as well as a logic 1. The delay cornponents 
include the current, the gate and difhsion capacitances, the input slope factor, and the 
series-connection delay degradation factor. We have also offered procedures for extracting 
the values of t hese component s t hrough &cuit-level simulations. 
Chapter 7 
Modeling and Optimization 
of CMOS Logic Styles 
A method of enhanhg performance and saving energy in digital CMOS circuits is to use 
a combination of conventional and non-conventionai logic styles, such as DCVSL and PTL, 
when appropriate. Unfortunately, the literature has little to Say about delay and energy 
calculations in unconventional CMOS styles in general. This chapter applies the d e d  delay 
mode1 of the previous chapter to CMOS gates implemented in the conventional, DCVSL, and 
PTL styles. It also presents closed-form formulas for optimal transistor sizing in each style. 
These f o d a s  are simplified furthes to rules of thamb for quick optimization of CMOS logic 
circtlits. 
Since the delay of a gate is eharacterized by its driving gate and its load, we write the 
delay expression for three cascaded gates. W e  refk to the gate at the middle as the c d .  The 
advantage of expressing the delay for three cascaded gates is that one can obtain closed-form 
formula for optimum transistor sizing of the c d  with respect to its drive and its load. These 
formulas can then be used to optimize a critical path of an arbitrary number of gates. For 
convenience, we ded with the gates at an abstract level. That is, we concern ourselvea only 
with the critical path in a gate and assume that ail series-connected transistors dong that 
path have the same size. 
The foxmulation of Chapter 3 neglects the signal slope &ect, has a long-chamel view of 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 122 
series-connected transistors, assumes a d o r m  PMOS to NMOS size ratio for all logic gates, 
uses a uniform gate capacitance value per unit width for PMOS and NMOS transistors, uses 
a uniform diffusion capautance value per mit width for PMOS and NMOS transistors, uses 
similar capacitance values for rising and f a h g  transitions, and, most importantly, is confined 
to the conventional CMOS logic style. In this chapter, we remove all of these limitations. 
The previous chapter showed that the interna1 node capacitances in a chah of similarly 
sized, series-connected transistors do not affect optimal transistor sizing. Hence, for simplifi- 
cation, the delay expressions in this chapter, which are used to derive the optimal transistor 
sizing formulas, do not include those capacitances. Their contributions to the delay, however, 
have been taken into account for producing the delay c w e s ,  as explained in the previous 
chapter. 
We use the familiar XOR gate as an example to verify our technique. The XOR gate, 
known as the MERGE in delay-insensitive circuits [34], is widely used in asynchronous design. 
This chapter also includes a cornparison between various CMOS implementations of the 
XOR gate. We assume that the XOR gate is operating in an asynchronous environment as a 
MERDE. This assumption imposes the following. 
O Gate optimization concerns the total delay over one switching cycle. 
0 One input may change at a time followed by an output transition. 
7.1 Conventional CMOS Style 
The mos t widely practiced CMOS logic style is the conventional one, which is well introduced 
in text books such as [51,72,102]. The conventional CMOS logic style usually offas a good 
trade-off between the performance and energy conwmption. 
Figure 7.1 depicts the schematic of a CMOS c d  implemented in the conventional pull-up 
pull-down style with its drive and its load. The input of the drive is assumed to be a step 
voltage fanction. Capacit ance Cd indudes the interconnect capacitances and any ot her gate 
and diffnsion capacitances at the output of the drive. Similady, Capacitance Ci indudes 
the interconnect capacitances and any other gate and dinasion caphtances at the output 




Figure 7.1: A conventional CMOS cell between a driving gate and output load. 
of the cell. Hence, the values for Cd and Ci may be different values for a rising delay and 
for a falling delay. The critical path in the pull-up section of the ce11 consists of n, PMOS 
transistors of which, usually, the dosest one to VDD is driven by the drive. This scenario 
produces the worst-case delay if the transition time of the input signal to the c d  is not very 
large [77]. Each of the series-comected transistors in the critical path have width W,. Due 
to symmetry, the  drive may also be connected to some other transistors of the same width 
within the puil-up section of the c d .  Let m, indicate the number of these transistors. At 
the output of the c d ,  again due to symmetry, there are e, drain difnisions of width W,. 
The pd-down section of the cell can be described in a similar fasbon by replaang W, with 
W,, n, with n,,, m, with m,,, and q, with q,. Each pull-up or pull-dom section in the drive 
or the load is also characterized by some W, n, m, and q. Therefore, we are considering a 
general case. 
As an example consider the carry generation circuit shown in Figtue 7.2 [72]. Assume 
that signal b is dong the critical path. Considering the pd-up section, for instance, sym- 
metry rnay force one to optimize both branches involving b simnltaneonsly. Hence, m, = 2, 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  S T n E S  124 
n, = 2, and qp = 2. Symmetry has to be usually observed in designing cells for a library of 
logic gates. In most other ckcumstances, however, only one of the branches determines the 
worst-case delay. In such cases, the symmetry is void and nt, = 1, n, = 2, and q, = 1. 
Figure 7.2: Carry generation circuit in a mirror adder. 
Turning back to Figure 7.1, let's combine all  capacitances at the output of the drive 
except those related to W, into one component CD as follows. 
Similady, let CL represent all capacitances at the output of the c d  except those related to 
WP* 
The rising delay between the input of the drive to the output of the cd can be expressed as 
where bc and bD are the step delays of the c d  and the drive, r&ectively. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 125 
8 B y solving = O one may obtain the optimum Wp for minimizing the r ishg delay. 
Rearranging the terms in the above expression leads to 
The term at the lefk is the falling delay at the input of the cd due to only W,, and the term 
at the right is the rising delay at the output of the cell through W, due to the load eduding 
W,. Thus, the optimum W, for rising delay must satisfy the following condition. 
Input fallzng delay due to W, as load= Output rising delay through W, excluding its own load 
This expression may be modified into a better form by observing the addition of the term 
to both sides of (7.9). By doing so, the right hand aide will become the total delay in 
produchg a rising output transition due to W, as load and the lef't hand side will become 
the total delay in producing that output through W, as a drive . Hence, the optimum sizing 
condition may be restated in this form. 
Delay due to W, as load = Delay thrwgh W, as drive 
Now we try to impose some approximations to simplify (7.8). Note that, ui general, kP 
depends on W,. If the output load is much larger than the difision capacitance related to 
TV,, then q,W, d, may be ignored in faveur of C& in (7.8). By nsing this approximation 
and replacing rip/ Y, with A, (7.8) takes the following form'. 
We need to p d o r m  fgrther approximations to obtain a simple d e  of thamb for circuit 
designers in opthking the rising delay. A circuit designer doesn't know the exact value of 
'A is equivalent to p defUled in Chpetr 3. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 126 
CL before examining the layout. However, since the fanout gates are known, 6 may be 
replaced by the  total width of the fanout WLt times the average gnte capautance per unit 
width g. Using this approximation for CL, n,, nD,, and negkcting the input 
dope d e c t  in (7.10) yields the following rule of thumb for optimizing the rise tirne. 
An advantage of (7.11) is that its parameters are physical and technology independent, 
except A. For A, there is evidence that in the future deep submicron technologies it has a 
constant value of around 2.2 [13,14]. 
Similar to wP, there is an optimum W, = wn for the falling delay. The falling delay 
fiom the input of the drive to the output of the cell can be expressed as 
where 
The optimum Wn for the f&g delay is then given by the following exact and approximated 
formulas. In the approximated forrn, the output load is assumed to be mach larger than the 
diffusion captrcitance related to Wp. 
B y some rearrangement s the following condition now appears. 
Input risz'ng delay due to W, as load= Output falling deZay through Wn excluding id own load 
This may be rnodified to 
Delay due to Wn as load = Delay t h ~ ~ u g h  Wndrive 
Applying the ssme appro~cimations as for w ~ ,  results in the foliowing rule of thumb for 
minimiring the falling delay. 
CHAPTER 7. MODELING AND OPTlMIZATION OF CMOS L O G E  STYLES 127 
Note that there are no optimum W, and W, for the falling and the nsing delays, respec- 
tively. Summing the rising and f&g delays gives the  total delay. 
From the total delay we can obtain a set of equations for the optimum sizing by setting 
ablaw, = sbpw, = O. However, we may as well use a short-cnt method by extrapo- 
lating the rules obtained in the optimization of the &ing and falling delays. We speculate 
that optimized values of Wp and W, for the total delay would satisfy the following set of 
conditions. 
Total Delay due to W, os load = Total Delay th~ough Wp 
Total Delay due to W, as load = Total Delay throvgh Wn 
Consider the first condition. If the similar ternis are cancelled out fkom both sides of the 
equality, the Ieft hand side has three components: one component due to the gate capacitance 
during the input hing  transition, one due to the gate capacitance during the input falling 
transition, and one due to the diffusion capacitance during the output falling transition. The 
right hand side has only one component, since W, is only active during the output rising 
transition. Wnting these components in the same order, the above specdation translates 
into 
Similarly, the second condition for the optimum W, translates into 
Combining the above two conditions, gives the following set of formulas for sidtaneous 
opthbat ion of the pd-np and pd-down sections of the cd. 
CHAPTER 7. MODELING AND OPTlMIZATlON OF CMOS L O G K  STYLES 
Solving ~ I D / ~ w ,  = ~ B J ~ W ,  = 0 c o n h s  our speculations, as i t  leads exactly to the same 
results. If the approximated formulas are used, then they are independent and may be solved 
separately. Otherwise, the two equations can be solved together by iteration. The value 
obtained from the one c m  be used in the other and so forth until the changes in the values 
are very small. Usudy it takes a few iterations to get the final results. The initial values 
of U.', and W* for starting the iterstion process may be obtained by the approximated set 
of formulas. A point worth mentionhg is that the optimization is supply voltage dependent 
mainly through fi, and V,. Using further approximations, as for the cases of the nsing and 
falling delays, leads to the following rules of thumb for optimization of the total delay. 
Where f' is the optimum width ratio of PMOS to NMOS transistors for minimizing total 
delay given by2 
which equals f i  if 
(7.24) 
the cell  is an inverter. This is an interesthg resuit , because it is inde- 
pendent of the driving gate. 
As far as the optimization of conventional CMOS gates is 
evidence, as demonstrated, to daim the theory shown in Figure 
concemed, there is enough 
7.3, where the term "due to 
21' is equivaient to r d&ed in Capter 3. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGlC STYLES 
1 Given a conventional CMOS logic gate producing the output signal OUT, 
1 the size of a transistor MOS dong the critical path of the gate is optimal, 
1 the sipal delay in produeing OUT due to the transistor MOS as the load 
I equals 
[ the signal delay in producing OUT through the transistor MOS as the drive. 
Figure 7.3: Informal claim reg~ding op timization of conventiond CMOS circuits. The delay 
may be a rising delay, falling delay, or average delay. 
the transistor MOSn includes the transistor MOS itself and those related to this transistor by 
symmetry, if any, as previously ewplained. This theory may find applications in developing 
CAD optimization tools. It also saves t h e  in quick hand-analysis and optimiration by jnst 
d t i n g  the relevant delay terms rather than the whole delay expressions. 
7.1.1 Conventional XOR 
Figure 7.4 depicts the schematic of a conventional XOR gate. Each input cornes from a 
driving gate, in this case an inverter. The output capadtance CL includes the loads due to 
the fanout and interconnects. One of the driving inverters, one of the pd-up (pd-dom) 
chains, and CL constitute the critical path. Here is a typical statement of the problem. 
0 Given are the following data on the environment. 
1. The driver's ïo = 2 = E. 
2. The load CL = 200 fF and VDD = 3 V. 
a Find the following. 
1. The optimal transistor skhg of the c d  sach that the delay over one cycle is 
!ninimum. 
2. The minimum delay and its corresponding energy dissipation. 
3. The behaviour of the minimum delay and its energy cost as VDD is s d e d .  
CHAPTER 7. MODELJNG AND OPTlMIZATION OF CMOS L O G E  STYLES f 30 
Figure 7.4: Conventional CMOS XOR Implementation. 
This section only answers the first part of the problem and leaves the rest for a later section 
when CMOS implementations of the XOR gate are compared. From the schematic we obtain 
m p = l ,  n,=l, n p = 2 ,  ra,=2,  qp=2,and q , = 2 .  
Table 7.1: Optimal transistor sizing for the conventional XOR gate. 
1 Applied Formula Set II WP (P )  1 w n  ( ~ m )  1 1 
1 Exact: 7.18 and 7.20 11 41 1 28 1 1.46 1 
1 Approximation: 7.19 and 7.21 11 44 1 25 1 1.76 1 
-- -- 
1 Rule of Thumb: 7.22 and 7.23 11 50 1 31 1 y67 
Solving the problem calls for applying the optimal transistor sizing formulas of the total 
delay. There are three choices: the exact set of formalas (7.18 and 7.20), the approxhate 
set of formulas (7.19 and 7.21), and hally the rules of thamb (7.22 and 7.23). (The tetm 
exact here implies preciseness with respect to the delay model, which of course, may be 
different fiom SPICE and the real world.) The resdt of applying these pairs of formulas are 
sammarized in Table 7.1. The values obtained from the t kee  formulas are fairly close to 
each other. Considering the fact that the delay asually han a broad minimum with respect 
to transistor sizing, dl three pairs of transistor sizes are acceptable. Actually, HSPICE 
simulations show that the difference in the total deIay asing these pairs of transistor sizea is 
less that 2%. R e d ,  however, that we are dealing with a pactidar situation and are anable 
to generalize the clifference in the results of the three formulas. For instance, if CL = O, the 
CHAPTER 7. MODELlNG AND OPTIMIZATION OF CMOS LOGIC STYLES 131 
approximate and the rule-of- thumb formulas bot h r e t m  ndl answers. Regarding the energy 
dissipation, HSPICE simulations show that the transistor sizing pair obtained by the d e s  
of thumb results in 10% more energy dissipation than the transistor sizing pairs obtained 
by the exact and approximate formulas. Figure 7.5 shows the results obtained by the exact 
1 1 I 1 1 1 I I l I 
1 2 3 4 5 6 7 8 9 1 O 
lteration Number 
Figure 7.5: Optimal transistor sizing of the conventional XOR gate using the derived formulas. 
The solid lines are obtained with the initiai dues of W, = W, = w = 1 Pm. The dashed 
lines are obtained with the initial values caldated fiom the approximated formalas. 
formulas (7.18 and 7.20) through iteration. If the minimum transistor width w = 1 pm is 
used as the initial value for both W, and W,, it takes about 10 iterations to establish the 
final results. However, the number of required iterations reduces by haIf, if the initial values 
are calcdated fiom the approximate formulas (7.19 and 7.21). Note that this is, in fact, 
a constrained, non-linear op timization problem. Hence, a natutal alternative approach in 
obtaining the optimal sizes is to use a non-lineac optimization technique. This has been 
the standard method in dealing with the problem of gate sizing. In order to CO* onr 
r e d t s  and evalnate the efnciency of our iterative technique, we &O solved the problem using 
MATLAB's constrained optimization package. The problem was defined for MATLAB as 
CHAPTER 7. MODELZNG AND OPTIMIZATION OF CMOS L O G E  STYLES 132 
M u r i m i z e  D asgivenby(7.17) 
Subjectto W,?w and W,?w 
As Figure 7.6 illustrates, MATLAB gave exactly the same results &es about 90 iterations. 
That is, about 10 times the number of iterations required by the pair of formulas. Realizing 
that each iteration in MATLAB 's program takes mach longer than merely evaluating a pair 




O 25 50 75 1 
lteration Nurnber lteration Nurnber 
Figure 7.6: Optimal transistor sizing of the conventional XOR gate using the delay models 
and optimization package of MATLAB. 
It is W c u l t  to confirm the optimal sizing redts  with HSPICE simulations, because 
exhaustive simulations wodd be needed. A partial confirmation may be obtained by âPng 
for example W, at  wP = 41 pm and checkhg ahether the optimal W, found by HSPICE cor- 
responds to W~ found by the modeL The outcome of this expairnent is shown in Figure 7.7, 
which indades the average, rising, and f a g  delays as obtained by HSPICE simulations as 
well as the model. One can make a few observations based on this figure. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 133 
1. The delay has a broad minimum, such that spotting the exact minimum point on the 
graph is hard. Therefore, we may consider an optimal range of Wn between 24 pm and 
33 prn for which the delay is very close to its minimum. 
2. The delay drops sharply as W, increases from w to the beginning of this optimal range. 
Hence, the process of optimal transistor sizing is, indeed, necessary. 
3. As W, increases beyond the optimal range, however, the delay increases rather slowly. 
Yet, this does not j u s t e  using arbitrarily large transistors, because the energy dissipa- 
tion increases linearly as the transistors become larger . Thus, over-sizing the transistors 
is a waste of energy. This is another solid reason in support of optimal transistor sizing. 
4. The optimum W, found by the model Wn = 28 pm not ody  is well within the optimal 
range, but also exactly matches the optimum W, offered by HSPICE simulations. This 
is c o h e d  by inspecting the delay values obtained by the simulations. 
5. Finally, very good agreement is achieved between the results of the model and those 
of HSPICE simulations for the average, rïsing, and falling delays. 
7.2 DCVSL CMOS Style 
DCVSL was hs t  introduced in [42]. Later, [16] desaibed a design procedure for DCVSL cir- 
cuits and [17] presented a cornparison between conventional and DCVSL fiill-adder circuits. 
DCVSL is, especidy, very efficient in designing fdl-adders. For instance, [73] reports the 
design of fast as ynchronous Ml-adder structures imp1ement ed in DCVSL. DCVSL gates are 
used in asynchronous circuits as completion signal detectors [61]. The operation of DIL gates 
is similar to DCVSL gates, except that DIL gates have memory, which is made possible by 
two additional minimum size NMOS transistors. Since these two transistors do not interfere 
with the switching activities of the gate, the arguments of this section applies to the DIL 
style as w d .  
Figure 7.8 illustrates the schematics of a general DCVSL c d ,  its dnve, and its load. A 
DCVSL gate, produces the output signal on one side and the complement of the output on 
CHAPTER 7. MODELING AND OPTIMIZATIOIV OF CMOS L O G E  STYLES 134 
2001 1 1 I f I 1 I 1 l 
5 10 15 20 25 30 35 40 45 50 
NMOS Transistor Width: Wn ( pm ) 
Figure 7.7: Delay estimation for the conventional XOR gate using simulations and the model. 
The width of PMOS transistor W, is fixed at W, = 41 pm obtained by the model. 
the ot h a  side. The assumption is that, by symmetry, the loads at both outputs are equal to 
CL, which indudes the fanout and the interconnects capacitances. The drive, as illustrated 
in the figure, may be a conventional CMOS gate or another DCVSL gate. The difference in 
the abstract representation of a conventional gate and a DCVSL gate is that in a DCVSL 
gate the pull-up section is always characterized by m, = O, $ = n, = 1, and q, = 1. 
Hence, it is more convenient to replace m, with m, % with n, a ~ d  qn with q, as shown in 
the figure. The pd-dom network of a DCVSL gate has two parts, ofken interconnected, 
corresponding to the tao outputs. These taro parts are related to each o tha  by symmetry 
and, thus, optimal transistor sizing of one automatically dictates the transistor sizing in the 
ot her one. 
In DCVSL gates, the rising output is actudy produced by the f a h g  output and, hence, 
CHAPTER 7. MODELZNG AND OPTIMIZATION OF CMOS LOGIC STYLES 135 
p q  qop Pull 
Figure 7.8: A DCVSL CMOS c d  between a dnving gate and output load. 
the worst-case delay is always represented by the k i n g  delay. Considering this fact, the 
optimization process only concerns the rking delay. The rising and the f a b g  delays between 
the input of the drive and the output of the c d  are expressed by 
respectively. Where bc is the c d ' s  falling step delay and DC is the d l ' s  rising s tep delay. 
Note that bc is not obtained by applying a step rising voltage to the input of the c d ,  
but rather by applying an imaginary, intemal step f a h g  voltage to the gate of the PMOS 
transistor producing the rising output. Also note that in the above equations only the rising 
step delay of the drive is used, as in this model, the f a g  delay of the drive does not a e c t  
the output dehy of the c d .  In reality, however, poor f d h g  delay of the drive might prolong 
the production of the output signal by generating short-circuit m e n t .  The rishg delay of 
the c d  depends on the f;illing step delay throagh (1 + 2fp) bC. The falling delay ezperiences 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  STYLES 
a case of opposing currents and, hence, its dope factor is caiculated accordhg to 
Using this term complicates the derivation of the optimal sizing formulas. Since in func- 
tional DCVSL gates the pull-down network must dominate over the pull-up transistor, we 
approximate by & in deriving those formulas. 
The current for the falling step delay Bc is the current through the pull-down chah 
minus the current through the PMOS transistor. During the f&g transition, the PMOS 
current changes fiom zero to its saturation value. Therefore, the average PMOS current, 
which is considered in delay calculations, equals half of its saturation valne. Following are 
the expressions for the step delays involved in the total rishg and f&g delay equations. 
q t* Oc = 
Wn f ip - 0.5Wp Yn Y, [( h+ d P ) ~ p + q  Lw. + C L ]  
Solving - ew, '= - sw, * - O yields the following set of formulas for the optimal W, and W, 
rninimizing the nsing delay. 
with 
and f being the optimum width ratio of the PMOS trawistor over that of the NMOS 
transistor for minimïzing the rising delay The above shows that f' inmeases as the output 
CHAPTER 7. MODELING AND OPTlMlZATION OF CMOS LOGIC STYLES 137 
load inmeases. For DIL implementations, one should add w,( c& + ip) to Cr. and 
( + ) to in the above formulas to account for the NMOS transistors of 
the inverter latch. Here, w, denotes the smdest dowable width for an NMOS transistor. 
Usually, the smdest dowable width for a PMOS transistor, w,, and that for an NMOS 
transistor, w,, are equal. Therefore, we can denote both by W .  
If the ceR's own output capacitances are ignored in favour of the fanout and interconnect 
capacitances, the above set of formulas can be approximated to 
The rules of thumb are obtained by further approximations as follows. 
Where 2/[1+ \/2(1+ s,)] ranges between 0.75 to 0.78 for the typical values of 9, between 
0.4 to 0.2, respeetively. For a DCVSL inverter, where n = 1, C = 0.75A, which is about 2 at 
VDD = 3 V in o w  0.5 pm CMOS technology. As n inaeases, the pull-down chah becomes 
weaker and, hence, C decreases to maintain the functionality of the DCVSL gate. Note that 
because of the race problem that exists in the operation of DCVSL gates, the rules fonnd 
for the optimization of conventional CMOS circuits are not dinctly applicable to DCVSL 
circuit S. 
The dynamic energy dissipation in CMOS gates with an output capautance C is usually 
caldated from E = VDDVm*,&, if short-circuit m e n t  and the interna1 capacitances are 
neglected. This formula can be readily applied to the conventional and PTL gate styles. 
In DCVSL gates, however, there is an additional component to the energy due to the case 
of opposing currents, which cannot be neglected. Therefore, we may express the energy 
CHAPTER 7. MODELING AND OPTIMlZATION OF CMOS LOGIC STYLES 138 
consumption in a DCVSL gate as the sum of two energies, a usefd component used to 
charge and discharge the output node and a wasted component. 
The usefd energy component per cyde is calculated from 
The wasted energy component per cycle is governed by 
Ewas t ed = 2 x supply voltage x wasted current x duration of wasting energy 
Where the wasted m e n t  is the current supplied by the PMOS transistor while it is fighting 
the output switching. As for the delay modd, this current is estimated as half of the PMOS 
transistor saturation current I'. The duration of wasting energy is, in fact, the duration of 
the fight, wkich is estimated as the time when the gate is activated until the rising output 
is produced D~~ 
We should mention that the theory of delay optimization, as mentioned for the conven- 
tional logic style, does not precisely apply to DCVSL style. The reason is the presence of 
overlapping currents in DCVSL gates. 
7.2.1 DCVSL XOR 
This section applies the DCVSL delay and energy models to the XOR gate whose schematic 
is depicted in Figure 7.9. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  STYLES 139 
Figure 7.9: Schematic of DCVSL XOR gate. 
Table 7.2: Optimal transistor sizing for the DCVSL XOR gate. 
1 Exact: 7.28 and 7.29 11 27.5 ( 35 1 0.78 ( 
1 Approximation: 7.30 and 7.31 11 41 1 35 1 1.17 1 
1 Rule of Thurnb: 7.32 and 7.33 11 41 1 42 ( 0.97 1 
The spedîcations for the drives and the load are similar 60 those defined for the conven- 
tional XOR gate, i.e. WDp/WD, = 20/10. However, the total 200 fF output capacitive load 
is equdy divided behveen the two ontputs of the DCVSL XOR gate, i.e. 4 = 100 fF on each 
side of the gate. The optimization problem we want to address is also similar to the problem 
statement made for the conventional XOR gate. In that case it was required to optimize 
the delay over one cycle, i.e. the total or average delay. For a DCVSL gate, the delay over 
one cycle is twice the rising delay. Hence, the optimal transistor siaing shodd target the 
rising delay. Since input a is driving two transistors, whereas input b is driving only one, 
the critical path is more likely to involve a rather than b. Therefore, the schematic identifies 
m = 2, n = 2, and q = 2. Table 7.2 summarizes the results of the optimization using the 
three formula sets: the exact set of formulas (7.28 and 7.29), the approximate set of formulas 
(7.30 and 7.31), and finally the d e s  of thamb (7.32 and 7.33). HSPICE simulations show 
that ushg the transistor sizing listed in Table 7.2, the rising delay is 381 ps for the exact 
formula set, 420 ps for the approximated f o r d a  set, and 401 ps for the d e s  of thumb. The 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 140 
energy dissipations for the three sizing sets are 12.3 pJ, 15.8 pJ, and 16.0 pJ, in the same 
order. Therefore, there is a maximum difierence of about 10% among the minimum delays 
achieved by the three formula sets, with the lowest delay produced by the exact formulas. 
This percentage of ciifference is much more than the 2% for the conventional XOR gate. We 
leave the reason for a later paragraph. 
Simulation: Rising Delay 
0 Model: Rising Delay - - -  Simulation: Falling Delay 
D Model: Falling Delay 
NMOS Transistor Width: W, ( pm ) 
Figure 7.10: Delay estimation and optimization for the DCVSL XOR gate using simulations 
and the model. The ratio W,/W, is kept constant to b d  the optimum W,. 
In order to c o h  the results of the model with HSPICE simulations, we report two 
sets of simulations. In the first set of simulations, as shown in Figure 7.10, W,/ W n  is kept 
constant at its optimal value 0.78 found by the rnodei while W, is changed fiom 5 pm to 
50 Pm. These simulations confirm the value of 35 p n  for the optimal W, obtained by the 
model. There is also a good agreement between the rising and falüng delays predicted by 
the model and simulations. The rising delay has a broad minimum covaing a wide range of 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  STYLES 141 
Wn values. The d u e  of Wn at which the falling delay is minimum is far less than the d u e  
required to minimize the rising delay. 
Simulation 1. i d e l  1 
PMOS to NMOS Width Ratio: Wp / W, 
Figure 7.11: Delay estimation and optimization for the conventional XOR gate using simu- 
lations and the model. Wn is kept constant to h d  the optimum Wp/Wn. 
In the second set of HSPICE simdations, Wn is kept constant at its optimal value while 
changing W,/W,. As illnstrated in Fignre 7.11, simulations approve the optimum W,/W, 
determiaed by the model. It is noticeable that the delay does not exhibit such a broad 
minimum with respect to W,/W, as it does with respect to W,. In fact, the delay of a 
DCVSL gate is more sensitive towards W, / W, when compared to a conventionai CMOS 
gate. For instance, the average delay of a CMOS inverter has a broad minimum between 
Wp/ Wn = 1.3 to 3, approximately. Whereas this range for the deiay of the DCVSL XOR 
is only between W,/ W, = 0.5 to 0.9, i.e. four times smaller. The higha significance of 
Wp/Wn in the case of a DCVSL gate is that this ratio, on the one hand, determines the fate 
of the fighting during the output switching and, on the other hand, determines the deIay. If 
W,/ Wn is made too large, the battle is lost , and if it is made too small the delay is poor. 
Looking back at Table 7.2, ae realize that the values of fin predkted by the three formnla 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  STYLES 142 
sets are close to eaeh other, but the same is not true for the values of l? . This justifies 
the reason behind the clifferences in the resultant minimum delays fkom the three sets of 
formulas. 
16 I 1 i I f 1 
t - HSPICE Simulation 
0 Model: lncludin Wasted Energy 
O Standard Calcu % tions 
t 1 
.t 
L 1 1 I 1 1 1 t 1 
5 1 O 15 20 25 30 35 40 45 50 
NMOS Transistor Width: W, (p) 
Figure 7.12: Energy estimation for the conventional XOR gate using simulations and the 
model. 
Findy, Figure 7.12 shows the the energy dissipation of the DCVSL XOR as a function 
of W,. The figure compares the results of HSPICE simulations with hand caldations. The 
&des represent caldations including the wasted energy based on the model of erpression 
(7.35). The squares represent caldations neglecting the wasted energy. The former are in 
good agreement with simulations. In this case, the energy wasted on fighting constitutes 
around 25% of the total energy consnmption. Note that the total energy inclndes the energy 
dissipated by the output Ioad and the difision capautances of the drive besides the the 
energy dissipated by the cell. Hence, the wasted energy is a major part of the DCVSL cell's 
energy consnmption, which cannot be neglected. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 
7.3 PTL CMOS Styles 
By pass-transistor logic (PTL), we refer to the general category of CMOS circuits in which 
signds, not necessarily VDD or ground, are passed from th& inputs to their outputs through 
a chah of MOS transistors. There are different styles of PTL circuits. Complementary Pass- 
transistor Logic (CPL), introduced in [105], is the most widely used PTL style. Design issues 
regarding CPL circuits are discussed in [69]. Double Pass-transistor Logic (DPL) is another 
PTL style which, compared to CPL, uses double the number of transistors to achieve higher 
performance [94]. According to our definition, the DILN C-element is a PTL circuit. PTL 
usuaily uses fewer transistors than DCVSL and conventional CMOS style for implementing 
the same function. 
Ce11 
1- 
Figure 7.13: A PTL CMOS c d  between driving gates and output load. 
Figure 7.13 illustrates the schematics of a PTL c d ,  its drives, and its load. A PMOS 
latch is usually used at the output of PTL circuits to reduce short-circuit energy dissipa- 
tion. The latch transistors have the minimum PMOS width allowed by the technology w,. 
Because of the s m d  size of the latch and the fact that both sides of the latch switch alxnost 
simultaneously, the latch does not interfkre in the operation of the circuit. This is in contrast 
to the case of the PMOS latch in a DCVSL circait. The inputs to the transistors' sources 
in CPL are only the variable signals and in DPL are combinations of the variable signals, 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 144 
ground, and VDD. In general, the drive for the gates of the transistors and the drive for the 
sources of the transistors rnay be different, as shown in the figure. To distinguish between 
the two, we refer to the former type of driving circuits as G-drives and refer to the latter 
type of diving circuits as S-drives. If a drive is connected to both the gates and sources 
of the transistors, for convenience, we still categorize it as an S-drive. The varieties in the 
structure of PTL circuits create a number of situations which need to be considered for the 
delay calculation and optimization. In PTL circuits the nshg delay is the worst-case delay 
and the subject of optimization, because NMOS transistors are slower in transferring a logic 
1 than transferring a logic O. The critical path in PTL circuits, however, may extend from a 
G-drive to the output or from an S-drive to the output. The critical path may also involve 
both a G-drive and an S-drive, as the upcoming discussions clarify. Therefore, there are 
t hree different cases, which are addressed in the following sub-sections. 
7.3.1 Critical Pat h Involving G-Drive 
In this case, the NMOS chah dong the critical path produMg the rising output is connected 
to VDD and, by symmetry, the NMOS chah producing the f a h g  output is connected to 
ground. Although this scenario is rare in PTL circuits, it should be covered as a possibility 
in the modeling of PTL circuits. This is also the least complicated case among the three. An 
example of this scenario is the DILN Celement. The following arguments are with reference 
to Figure 7.13, if the S-drive is ignored. 
The rising and f a g  ddays as fnnctions of the step delays are given by 
where the step delays are calculated from these expressions. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 145 
Where CG indudes all capacitances at the output of the G-drive , except the capacitances 
related to the cd. The above results in the following expression for the optimum Wn for the 
rising delay. 
Where the approximation is h o s t  always valid, because of the small size of the PMOS 
latch. As before, the d e  of thumb is obtained by replacing CL with a total looding gate- 
oxide capscitance g WLt. 
Inspection of (7.37) reveals that the theory of delay optimization stated for the conven- 
tional style, also applies to a PTL gate whose dday is governed by a G-drive. It doesn't 
apply, however, to a PTL gate whose delay is governed by an S-drive or by both drives. 
In fact, the theory applies to any CMOS topology that does not include overlapping and 
opposing currents. 
7.3.2 Critical Path Involving S-Drive 
This seems to be the dominant case in PTL circuits. MOST CPL gates, like the ORINOR gate 
shown in Figure 7.14, resemble this structure. In this scenario, as depicted in Figure 7.14, 
the NMOS chah dong the critical path is connected to the output of another &cuit, an 
S-drive. The S-drive also controls the gates of m transistors of width Wn in the c d ,  where 
m is gïeata  than or equal to zero. If m = O, then the assumption is that the transistors 
dong the chah have already been tunied on by a Gdrive. There are q. diffasions of width 
Wn at the output and qi such dfisions at the input, as shown in the fgure. 
The rising f a h g  delay expression is as follows. 
AND OPTIMIZATION OF CMOS L O G E  STYLES 146 
Figure 7.14: A PTL CMOS ceIl connected to an S-drive dong the critical path. A CPL 
OR/NOR gate is also shown as an example. 
Where the fist  term represents the step nsing delay of the ceU. In this term, the parentheses 
endose the total resistance of the critical path, which is the sum of the S-drive's pull-up 
resistance and the resistance of the NMOS chah A similar expression governs the falling 
delay. The S-drive step rising delay is given by 
where Cs inchdes alI capacitances at the output of the S-drive, except the eapacitances 
related to the c d .  The optimal transistor sizing for the rising delay can be calculated using 
the following formula 
which leads to this d e  of thnmb. 
1 
CHAPTER 7. MODELING AND OPTlMlZATION OF CMOS LOGIC STYLES 147 
7.3.3 Critical Path Involving Both G-Drive and S-Drive 
In this case, as illustrated in Figure 7.15, the assumption is that the input signal, which is 
the output of an S-&ive, has aheady arrived at the source of the NMOS transistors chah. 
Once the diain is turned on by the G-drive, the output of the S-drive is transferred to the 
output of the c d .  Therefore, both drives influence the delay. The CPL XOR/XNOR gate 
shown in Figure 7.15 is an example of a circuit subject to the scenario discussed here. 
(ab '+a8b) (ab '+a 'b) ' 
XORBCNOR 
G-Drive 
*cn n ~ n  
w ~ p  n ~ p  
Figure 7.15: A PTL CMOS c d  connected to an S-drive dong the critical path controlled 
by a G-drive. A CPL XOR/XNOR gate is also shown as an example. 
Because of the complications involved in the delay modehg for this case, we need to 
use some approximations, which we shall j u s t e  using an RC network model. Consider the 
RC network of Figure 7.16. Initially, Cl is fidly charged, bat 4 has no charges stored. 
We are interested in estimating the delay fiom the t h e  the switch is dosed to the thne 
when the potential across Ca reaches half of the value of the sapply voltage, i.e. unta Ca is 
halfiy charged. This network wïth its initial condition can be easily solved by the standard 
methods of circuit theory. The resdt, however, is complicated and involves many terms. 
Using Elmore's theory [35], if Cl had no initia charges, the delay wodd be 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 148 
Since Cl is akeady charged, the term CIRl should be spared. However, by doing so, the 
influence of Cl on the delay is t o t d y  ignored, which is a good approximation if Ci is mu& 
smaller than C2, but may result in large errors othenvise. On the other hand, if Cl is 
much larger than C2, the delay does not depend on R I ,  because Ci supplies C2 with al1 the 
necessary charges. Hence, the following approximation of the delay is valid, at Ieast , in these 
two extreme cases. 
Where the factor 2 appears in the denominator to account for the fact that Ci is M y  
charged, as opposed to the halfly charged state of C2 we are seeking for the delay estimation. 
Figure 7.16 shows how this approximation compares to the simulation results for a range of 
C2/Ci. As expected, the approximation works well when Cz is much larger or much smaller 
than Ci. In between these two cases, the accuracy of the approximation depends on the 
relative values of R1 and R2. The error seems to be larger when R2 > RI, but this is very 
unlikely in o u  application. 
Turning badc to the PTL circuit of Figure 7.15, Cl resembles the total capacitance at the 
input of the cell Cr, i.e. output of the S-drive, and 4 resembles the total output capacitance 
of the cell Co. 
Using the RC 
falling delays. 
network approximation yields the following expressions for the rising and 
AssMling Co is much larger than Cl resdts in 
CHAPTER 7. MODELING AND OPTlMlZATION OF CMOS L O G E  STYLES 
- - -  Model 1 -  Simulation 
Figure 7.16: An RC network for rnodeling the delay in PTL circuits invoiving both a Gdrive 
and an S-drive 
and, consequently, the following d e  of thumb for optimal transistor sizing. 
If Cr is much large than Co, then the S-drive may be ignored, as if the source of the NMOS 
chah is connected to VDD. That is, the scenario simplifies to the fust case discussed in 
Section 7.3.1, and the optimal sizing formula becomes 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  STYLES 150 
7.3.4 CPL XOR 
Figure 7.17: Schematic of CPL XOR gate. 
The CPL implementation of the XOR gate is depicted in Figure 7.17. As mentioned before, 
the worst-case deloy is more likely to involve both the driving gate for b and the driving gate 
for a. The sizes of the driving inverters and the output load are similar to those previously 
defined for the conventional and DCVSL XOR gates, i.e. WDp/WD, = 20/10 and CL = 100 fF 
on each side. The schematic reveds that m = 2, n = 1, and q, = 2. Applyiag (7.47) results 
in Vkn = 14.70 Pm, whereas (7.48) yields & = 16.40 pm. Both results are acceptable and 
within the range of W, producing a minimum delay, as confirmed by HSPICE simulation 
resdts shown in Figure 7.18. Figure 7.18 &O demonstrates the good agreement between 
the model and the simulations in predicting the rising and falling delays of the CPL XOR 
gate. Note that the CPL XOR gate involving both drives, as examined here, represents the 
most complicated scenario among the three cases presented for the delay estimation in a 
PTL gate. As explained, we have incorporated some additional approximations in the delay 
modeling for this particular case and, naturally, expect some degree of disagreement with 
HSPICE results. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  STYLES 151 
- - -  Simulation: Rising Delay 
D Model: Rising Delay 
. - . - . .  Simulation: Falling Delay 
O Model: Falling Delay 










7.4 Comparing CMOS Implementations of xOR 
Being able to optimize the performance of the gates implemented in various logic styles, we 
are now ready to perfonn a fair comparison between the CMOS implementations of the XOR 
gate. When the gates are implemented in different logic styles, there is a high possibility 
t hat t heit pgformsnces are not equally Sected by the supply voltage scaling. Hence, for the 
XOR gate, ne are adding another dimension to our comparison technique developed for the 
case of the Celement. That is, the performance and the energy of the gates are evaluated at 
diffkrent supply voltages. At each value of the mpply voltage, the gates are shed to deliver 
the minimum worst-case delay over one switching cycle. The optimization is perfomed asing 
the formulas derived in this chapter. 
5 10 15 20 25 30 
NMOS Transistor Width: Wn ( l m  ) 
\ 1 
i ' , A, 
\ \ u\ - \ \ 
O '\ O \  
0% . 
\. - O\ . ,  o - o - ; l _ _ _ _ _ _  _ _ _ _ _ _ d - - - - - - -  
0 '., - O O - . - . -  
. - . - . - . - . -  . - . - .  
O O * 0 ; 
- 
u u ~ a O C 1 ~ ~ ~ ~ ~ O o O ~ ~ a o  
. - . - . _ . - .  * o-0-0------~> 
I I 1 t 1 # 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS LOGIC STYLES 152 
Figure 7.19 shows the  behaviour of the minimum delays and th& corresponding energies 
for the conventional, DCVSL, and CPL XOR gates at different values of the supply voltage. 
The results are obtained by the modd and HSPICE simulations. Véry good agreement is 
observed between the two for the delays and energies of all three styles. The outcome of the 
cornparison is very much in favour of the CPL implementation in t e r m s  of delay and energy. 
However, it is noticeable that the performance of the CPL XOR gate degrades faster than the 
conventional and DCVSL XOR gates as the supply voltage is scaled down. For instance, at 
VDD = 1.5 V, the conventional and DCVSL XOR gates both outperform the CPL XOR gate. 
In this study, we have not considered the behaviour of the gates in relation to the clifference 
in the arriva1 times of the complementary input signals and the possible effects of the charge 
sharing phenomenon. 
7.5 Concluding Remarks 
In this chapter, we have derived macromodels for the main CMOS logic styles by applying 
the delay mode1 developed in the previous chapter. The CMOS logic styles studied include 
conventional, DCVSL, and PTL. Based on the macmmodels, we have derived exact and 
approximate closed-form formulas for the optimal transistor sizing of the various CMOS 
logic styles in terms of the driving gates and the loads. In addition, h t h e r  approximations 
were applied to corne up with a set of simple formulas to be nsed as rnles of thumb for the 
optimization of each CMOS logic style. The macro-models and the optimal transistor s e g  
formulas were compared with HSPICE simulations for different CMOS implementations of 
the XOR gate. Finally, the optimally-sized XOR gates were compared in terms of the delay 
and energy at varions values of the supply voltage. 
CHAPTER 7. MODELING AND OPTIMIZATION OF CMOS L O G E  STYLES 
- Simulation: Conventional 
0 Model: Conventional 
. . . . .  . Simulation: DCVSL 
0 Model: DCVSL - - -  Simulation: PTL 
0 Model: PTL 
- Simulation: Conventional 
Model: Conventional 
Simulation: DCVSL 
0 Model: DCVSL 
- - -  Simulation: PTL 
0 ModekPTL 
1 I I I 
2 2.5 3 3.5 
Power Supply Voltage : CIDD ( V ) 
Figure 7.19: Delay and energy dissipation versas VDD for the optimized CMOS implemen- 
tations of the conventiond (standard), DCVSL, and PTL (CPL) XOR gates. The drive's 
Wb/Wh = 20/10 and total CL = 200 fF. 
Chapter 8 
Delay Estimation and Optimization at 
the Module Level 
This chapter demonstrates how our hdings in the previous chapters can be used to estimate 
and  optimize the delay in a structure consisting of a number of logic gates. Such an structure 
is usually called a functional block or a module. One section in this chapter is devoted to 
delay estimation and optimization of conventional CMOS logic circuits. Another section is 
devoted to dday estimation and optimization of mixed logic style CMOS circuits. Finally, 
the chapter concludes with some remarks. The following two paragraphs present a qui& 
summary of the previous two chapters and shows how they are related to the content of this 
chapter. 
At the device level, we presented a d e d  model for the average current in an MOS device 
that covers a logic 1 transfer and a logic O transfer. Using this model at the switch level, 
we formdated the PMOS rising delay, NMOS f a h g  delay, PMOS falling delay, and NMOS 
rising delay. The swîch-level delay models accommodate the signal slope d e c t  and the 
short-channel behaviour of series-connected MOS transistors. At the logic level, we applied 
the switch-level delay model to dgive delay macromodels for diffetent CMOS logic styles 
induding conventional, DCVSL, and PTL. The macromodel for a logic gate is expreased in 
tetms of the size snd the topology of the gate itself, the size and the topology of the driving 
gate, the size of the loading gates, and any spuriolls capacitances. In the delay and energy 
macromodels for a DCVSL gate, we included the d e c t  of the race between the PMOS latch 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTlMlZATlON 155 
and the NMOS network during output switching. We also derived three macromodels for 
a general PTL celi based on whether the critical path involves a gate drive, a source cirive, 
or both. At the module level, to estimate the delay of a path containing mixed logic styles, 
we simply add the delays of the gates dong the path using the developed logic-level delay 
macromodds. 
AS fat as dday optimization is concerned, at the logic level, we derived exact and appmx- 
imate optimal transistor sizing formulas for minimizing the delay of different CMOS logic 
styles. We also developed a theory for delay optimization of conventional CMOS logic gates. 
At the module level, we use these hdings to address the problem of delay optimization. 
Given a circuit with an arbitrary number of logic gates, it is possible t O use the formulation 
presented in the previous chapter to obtain the worst-case dehy in the circuit and rninimize 
it by iteration. The worst-case delay is the maximum delay over all of the paths in the cir- 
cuit extending fkom the input to the output. Sometimes it is possible to i d e n t i .  the critical 
path in a circuit by inspection. If in doubt, the paths' delays may be re-estimated after the 
op timization process. 
8.1 Convent ional CMOS Logic Circuits 
The previous chapter showed that the size of a transistor in a conventional CMOS gate is 
optimal if the delay due to that transistor (and any other related transistor due to symmetry) 
as a load eqnals the delay through that transistor as a d.rive. We mentioned that, since a 
CMOS gate may have more than one branch in its pull-up and its pull-down networks, any 
possible symmetry between the sizes of the transistors in these branches should be comidered. 
For this purpose, with reference to Figure 7.1, we denoted the namber of transistor gates 
related by symmetry with m and the number of transistor f i s i o n s  related by symmetry 
with q. The syrnmetry is ofben important in designing cell 1ibraz-k and optimizing the delay 
of individual gates for a given input drive and an output load. 
In a circuit consisting of a nnmber of conventional CMOS gates, however, usually only 
one of the pd-up branches and one of the pull-doan branches in a gate determine the worst- 
case delay of that gate dong the critical path of the circuit. Therefore, observing symmetry 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 156 
in optimizing the gates is not necessary. We call the combination of such a pull-up brandi 
and such a pull-down branch within a gate a stage. For a stage, usudy m = 1 and q = 1. 
In a module or a path, we need to optimize these stages along the path. Hence, we may 
state that the delay of a circuit consisting of conventional CMOS logic gates is optimal if 
for each stage along the critical path of the circuit the following holds: the delay due to that 
stage as a load equals the delay through thut stage as a dnve. According to the derivation of 
Section 7.1, spurious capacitances dong the path, signal slope factor, and series-connected 
transistors do not void this theory of delay op timization in conventional CMOS logic circuits. 
Figure 8.1: Thtee stages along the critical path of a conventional CMOS logic circuit. 
Figure 8.1 illastrates tkee stages dong the critical path of a CMOS logic circuits con- 
sisting of conventional gates. Using the stated theory, for optimizing the rising M a y  of stage 
i we have 
where the left hand side represents the delay of the stage as a load and the right hand 
side represents the delay through the stage as a drive. Note that the delay slope factor of 
CHAPTER 8. MODULE L E V U  DELAY ESTIMATION AND OPTIMIZATION 157 
stage i must be accounted for if i is not the 1 s t  stage to be optimized. This is indicated 
by < 1 + S, > in the above equation. Also note that, in order to express the complete 
delays, we should add the following delay term due to the interna1 capacitances to both sides 
of (8.1). 
This term, however, cancels out bom both sides. The reason for adding this tenn on both 
sides is that it is a delay term in which the pull-up network appears as both a load and a drive. 
For quick checking of optimization results, a designer may use the foUowing approxixnation 
of (8.1). 
Where the signal slope factors have been ignored, the series-connection factor Y has been 
replaced by n, the PMOS gate capacitances per unit width for rising and falling transitions 
Q, and Qp have been replaced by the average gate capacitance per unit width g, the NMOS 
diffushg capacitance per unit width for nsing delay has been replaced by the average 
diffusion capacitance d and, similady, t the rising delay load has been replaced by C ' ( à ) .  
As far as the total or average delay is concerned, according to the theory, the following 
set of equations must hold for each stage i on the path. 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 158 
Similady, the following must hold for op timizing the f a n g  delay of stage i 
Consider a critical path with N logic stages. For optimizing the total delay, there are 2 N 
equations and 2N unknowns. This system of non-linear equations may be solved by iteration 
with the minimum ailowable transistor width w as the initial value for all  transistors. 
G1 6.3 
Figure 8.2: An example of a critical path in a conventional CMOS logic circuit. 
For example, the critical path of Figure 8.2 indudes four gates Gl throagh G4. The 
driving gate Go has W@ = 20 pm and WM = 10 Pm. The spurious load capacitances are 
CO = 300 fF, Cl = 150 W, Ca = 100 fF, C3 = 200 fF, and CI = 250 fF. Using (8.3) &es 
the results listed in Table 8.1 for optimizing the total delay in the path. With a circuit of 
this size, it is aknost impossible to verify the optimization results by a circuit level simulator 
such as HSPICE. 
Table 8.1: Optimal transistor sizing for the critical path of Figure 8.2. 
Although the conditions for optimizing the rising, f e g ,  and total delays in conventional 
CMOS logic circuits may be turned into dosed-fonn formulas, similar to those of the previous 
chapter, the message we want to convey in this section is that one may easily derive these 
formulas by knowing the developed theory of delay optimization. This is important for a 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 159 
quick hand analysis of the optimization results. Let us check the results of Table 8.1 by 
examinkig, for example, Wpz From the theory, we know that the total loading efTect of 
wp2 on the delay must equal its driving &ect on the delay. Therefore, the following must 
approximately hold for w ~ .  
Where factors 2, 1, 1, and 4 are the numbers of series-corinected transistors in the NMOS 
network of Gl, PMOS network of G1, NMOS network of G2, and PMOS network of G2, 
respectively. Substituting the values for the parameters yields 
The left hand side evaluates to 151 ps and the right hand side evaluates to 132 ps. Considering 
that we have ignored a number of effects, the clifference of about 10% between the h o  sides 
is quite acceptable. On the other hand, if we calculate wP2 h m  
the result is 169 Pm, which is very close to the value listed in Table 8.1. Such a quick check 
gives the designer some degree of assarance that the optimal transistor sizing produced by 
the employed CAD tool are indeed valid. 
8.2 Mixed Logic Style CMOS Circuits 
This section discusses optimization of digital CMOS circuits involving mixed logic styles. 
Table 8.2 summarizes the optimal transistor sizing formalaa for different CMOS logic styles. 
The formulas are similar to those of Chpater 7, except for the terms endosed aithin 9. 
Such a term accounts for the signal dope eîfect of the stage being optimized on the nert 
stage. Therefore, they should not be inclnded if the stage being optimized is the last stage on 
the path. In this regard, the formulas given in Chapta 7 as d e s  of thamb remain unchanged 
for optimiaing a path, except f for DCVSL, which changes fiom (3/4n)A to (4/5n)A. As 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 160 
Table 8.2: Optimal transistor sizing of CMOS logic styles. Terms enclosed within "< >" 
should not be included if the stage being optimized is the last stage. Notation: snbscripts 
n (NMOS), p (PMOS), D, G, S (drives), and L (load); Accents: "" (rising transition) and 
"' " (falling transition); A = Y,/ V,. The parameters are defined in Chapter 6 and Chpater 7 
with reference to Figure 7.1 for conventional CMOS, Figure 7.8 for DCVÇI, and Figure 7.13 
for PT L. 
II CONVENTIONAL CMOS LOGIC STYLE 
Rising Delay: wp = 8 I J w ~  
C,+,W, & 4 Pp<i+ Sn> 
m~ a p  yn Y~n(l+ Sp) 
Falling Delay: CL+@W, d, ii, kcl+ Sp> 
m n  i n  GP hp(l+ Sn) 
11 Average or Total Delay: 
II DCVSL and DIL S T n E S  
II PTL STYLES (CPL, DPL, DILN, DILP) 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTlMIZATION 161 
we mentioned in the previous section, the symmetry related parameters m and q for the 
conventional Iogic stages usudly equal unity. This, however, is not tme for the difkrential 
logic styles. Due to the inherent symmetrical stmctures of DCVSL and PTL styles, the 
symmetry related parameters are often larger than unity for these logic styles. The reader 
is reminded that, for DIL implementations, one should indude w( $ + A) in & and 
w( &, + ij,) in cL in the corresponding formulas listed in Table 8.2. This is to account for 
the NMOS transistors of the inverter latch, which are not active during switching. We should 
&O remind that the parameters used in formulas of Table 8.2 are defined in Chapter 6 and 
Chpater 7 with reference to Figure 7.1 for the conventional CMOS logic style, Figure 7.8 for 
DCVSl, and Figure 7.13 for PTL. 
We have implemented the optimal transistor sizing formulas of Table 8.2 as functions in 
a programming Iibrary. Each set of fornulas corresponding to a logic style is represented by 
a fuction of the following pattern. 
Where LogicStyle is conve for a conventional CMOS implementation, dcvsl for a DCVSl 
implementation, ptlgd for a PTL implementation whose worst-case delay is govemed by a 
G-drive, ptlsd for a PTL implementation whose worst-case d&y is govenied by an S-drive, 
and ptlbd for a PTL implementation whose worst-case delay is govemed by both a Gdrive 
and an S-drive. Note that VDD has to be specified, because some of the parameters, mch 
as v, Y, S, and the capacitances are supply voltage dependent. Separate fanetions in the 
library calda te  each of these parameters for NMOS and PMOS transistors and, for each 
transistor type, for rising and falling transitions. The caldations are based on the models 
of Chapter 6. In addition to the opthkation fùnctions, the library indades fanctions for 
estimating the delays of the CMOS logic styles. These fanctions are based on the delay 
macromodels developed in Chapter 7. The rest of this section presents an example for dehy 
optimization and estimation in a CMOS circuit including mixed logic styles. 
Chapta 2 explained that, in asynchronous circuits, there are tao common signalhg 
protocols for commdcating data between a sender and a receiver: the four-phase and 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 162 
Figure 8.3: A four-to-two phase converter [34]. 
the twephase protocol. If the sender is using the four-phase protocol and the receiver is 
using the two-phase protocol, then a module c d e d  four-to-two phase converter provides the 
interface between the sender and the receiver. A delay-insensitive implementation of the 
four-to-two phase converter is depicted in Figure 8.3 (341. This module uses a TOGGLE, a 
JOIN (C-element), and a MERGE (XOR gate). In the figure, r f denotes the request signd 
fiom the four-phase sender, a f denotes the acknowledgment signal to the four-phase sender, 
rt denotes the request signal to the twephase receiver, and at denotes the aduiowledgment 
signal from the twephase receiver. An operation cycle of the four-to-two phase converter 
starts with a transition on T f and ends with a transition on a f .  In between these two 
transitions, there are two sets of events that may take place in pardel. One set of events 
consists of a transition on a f followed by a transition on r f . The other set of events consists 
of a transition on rt foIlowed by a transition on ut. 
Assume that a designer has decided to implement this four-to-two phase module with a 
conventional CMOS TOGGLE, DIL C-element, and CPL XOR gate. The DIL Celement and 
CPL XOR gate were introduced in the previoas chapters. The behaviors of the outputs of 
the TOGGLE, b and c, can be expressed in t g m s  of the input a and the previoas states of 
the outputs, & and 6 by the following Boolean functions. 
b =  a d  + a'b 
c =  a'b + a ê  
A conventional implementation of the TûGGLE dgived fkom these expressions is illustrateci 
in Figure 8.4. A similar implementation with an additional initiabation signal is presented 
in [37] under the name Yaatchev TOGGLE. In the TôGGLE implementation of Figare 8.4, 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 163 
Figure 8.4: Schematic of a conventional implementation of the TOGGLE. 
some of the transistors are not active during output switching and are o d y  dedicated to 
maintaining the state of an output. These transistors are assigned the minimum width W .  
Figure 8.5 depicts the transis tor-level schematic of the four-htwo phase converter. Each 
input and each output of the module is connected to a b d e r  with a PMOS transistor of 
width Wpb = 15 pm and an NMOS transistor of width Wd = 10 Pm. The transistor sizing 
parameters to be determined for the ToGGLE include Wpr, Hfni, Wpli Wnl, Wp3. WnS, W.? 
and Wd. Similady, the transistor sizing parameters to be determined for the C-element 
include WP5, and Wd. The XOR gate has only one transistor sizing parameter, Wd. The 
rest of the transistors in the circuit have the minimum width w, as indicated. One way 
of optimizing this cirenit is to target minimai delay in conveying the reqaest signal to the 
receiver, i.e. producing rt,  and minimal delay in produchg the last event of an operation 
cycle of the module, i.e. prodacing the second transition on cl f.  This requires calling the 
fonowing fnnctions, which are evaluated in a number of iterations. 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 164 
The reader may verify our procedure by inspecting Figure 8.5 and the general pattern for 
the optiniization funetions given in (8.5). Table 8.3 lists the results of the optirniration 
process. To evaluate the delay along a path, we add the delays of the logic stages along 
that path. Using the transistor sizings suggested in Table 8.3, the estimated average delay 
between receiving the f i s t  transition on rf and producing the following transition on rt is 
210 ps, and the estimated average delay between receiving the second transition on rf and 
producing the following transition on af is 750 ps. 
Table 8.3: Optimal transistor sizhg for the four-to-two phase converter of Figure 8.5. 
8.3 Concluding Remarks 
This chapter has demonstrated a method of deLay optimioation in conventional and mixed 
logic CMOS circuits by using the theory of dday opthbation for conventional CMOS cir- 
cuits and the logic-level delay macromodels developed in the previous chapter. The delay 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 165 
Figure 8.5: %xmsistor-Ievel schematic of an aspchronous four-to-two phase converter. 
CHAPTER 8. MODULE LEVEL DELAY ESTIMATION AND OPTIMIZATION 166 
macromodels may also be used to optimize a circuit under multiple constraints of delay, 
energy, area, and supply voltage. This issue, however, is not part of this thesis. Another 
possible work is delay optimization at the system level. System level optimization requires 
a separate study, because optimizing the delay dong a path may actually inaease the delay 
of another path which, in turn, may increase the overd worst-case delay of the system. 
Chapter 9 
Conclusion 
In t his thesis, we have presented a technique for modeling, evaluation, and optimization of 
digital CMOS circuits. Based on this technique, we have derived delay models and closed- 
fom optimal transistor sizing formulas for several conventional and dxerentid CMOS logic 
styles. The scope of our technique covers the device, switch, logic, and module levels of 
abstraction. The technique evolves fkom the idea of explicit formulation of the delay of a 
logic gate in terms of its own size and the sizes of its driving and loading gates. We have 
applied the developed delay models and optimization formulas to a nnmber of asynchronous 
circuits primitives and modules. 
One of the contributions of this thesis is the theory of delay optimization in CMOS logic 
circuits. The theory states that the delay in a c i ~ a z t  corrPisting of conventional CMOS 
logic gates is minimal if for each stage dong the m'ticat path of the cimrit, the delay due 
to that stage (as a load) equals the delay th~ough that stage (as a driue). In other words, 
if the loading efEect of each stage equals its driving effect, then the delay dong the path 
is mlliimd. This theory also generally holds for Logic gates which experience no cases of 
overlapping and opposing currents. We have shown that the theory is valid even when the 
input slope factor and the &ect of serial conneetion of transistors are taken into account. 
Moreover, presence of branches and spurious capacitances dong the path does not void the 
theory. This theory simplifies the process of delay optimization into expressing and solving 
a system of non-linear equations by iteration. The theory is &O convenient for checkhg the  
optimal transistor siaing resdts of a CAD tool. 
CHAPTER 9. CONCLUSION 168 
Another contribution of this thesis is the derivation of the optimal transistor sizing for- 
mulas for both conventional and unconventional CMOS logic styles. These optimal transistor 
sizing formulas enable optimization of mixed logic-style CMOS circuit S. The logic styles cov- 
ered include conventional, DCVSL, and PTL. We have derived three optimal sizing formulas 
for PTL style, based on the position of the driving gate or gates. We have demonstrated 
optimal transistor sizing of a mixed logic style module by using these formulas. 
A third contribution of this thesis which, in fact, has led to the previous two contributions, 
is the development of a unified delay model for CMOS logic styles. The delay model includes 
an expression for the saturation current of short-channel MOS transistors that is equally valid 
for rising and falIing transitions. The model captures the &ect of input signal dope and 
characterizes the behaviour of MOS transistors connected in series. This model shows that 
the interna1 diffusion capaci tances of a chah of similady-sized series-connected transis tors 
do not affect optimal sizing of the chain. 
These three contributions are likely to have impact on developing logic simulators and 
optimization tools that support mixed CMOS logic styles in fnture. The rest of this chapter 
reviews the results of this thesis and highlights directions for future work. 
9.1 Review 
The following is a review of the material presented in this thesis divided into subsections on 
modeling, optbization, and applications. This summary is intended to be self-contained. 
9.1.1 Delay Modeling 
The scope of the  delay modeling technique presented in this thesis covers a number of 
abstraction levels. At each level, the reliability of the model has been verified by HSPICE 
simulations. In general, the proposed model exhibits very good agreement wîth HSPICE 
simulations. We have followed a bottom-up approach in delay modeling. 
Starting at  the device level, we have proposed the fonoaing expression for evaluating the 
CHAPTER 9. CONCLUSION 
saturation curent of submicron MOS devices. 
Where W is the effective width of the device, parameter K depends on the technology and the 
effective channel length, index < captures the velocity saturation effect , and index 9 captures 
the mobility degradation efKect. This model is considerably more accurate than the popular 
a-power Law, which replaces 6 + S/V& with a constant index o. With proper parameters, 
the above expression is also used to represent the saturation cnrrent in an NMOS device 
transferring a logic 1, rather than 0, and in a PMOS device transferring a logic O, rather 
than 1. Overall, four sets of parameters K ,  c, and 8 are required to charact&e the four 
types of curren t S. Procedures for extrac ting t hese paramet ers t hrough circuit simulations 
have been provided. 
At the switch level, we recognize four types of delays: PMOS hing delay, NMOS f&g 
delay, NMOS rising delay, and PMOS falling delay. These delays are derived using the 
correspondhg four types of currents. To capture the d e c t  of input signal slope, we have 
extended a previously reported technique to the four delay types. This technique adds a 
fraction of the input transition time to the step delay. Moreover, we have offered a semi- 
empirical model t bat charact erizes the delay behaviour through a chah of MOS transistors 
connected in series. According to our model, the general expression for the delay through N 
similady-sized transistors connected in series is given by 
Where Do is the stepdelay of the driving gate, S is the input slope factor, and X and Y 
are the degradation factors related to N. Parameter v = WVDo/(21D), Wh is the total 
gate-oxide width of the load, Wu is the total difision width of the load, g is the gatesxide 
capacitance per unit width, d is the difision capacitance per unit width, and Cf is the load 
capacitance due to interconnects. Note that Mixent X, Y, and S are defined for each of 
the four delay types. Also, g and d are different for NMOS and PMOS transistors as well 
as for the riskg and f&g transitions. We have included expressions for X ,  Y, and S and 
methodologies for extracting d and g tkough &cuit simdations. Fkom the above expression 
CHAPTER 9. CONCL USION 170 
for the delay, it is clear that neglecting the diffusion capacitances does not affect optimal 
transistor sizing. Other topics studied at this level include the influence of overlapping and 
opposing currents on the delay. We have suggested intuitive expressions to cover these two 
cases. 
At the Iogic level, we have applied the switch-level delay mode1 to formulate delay macrw 
models for different CMOS logic styles including conventional, DCVSL, and PTL, as depicted 
in Figure 9.1. Defuring and accommodating the four types of delays at the switch level has 
proven an innovative move that enabled us to treat CMOS gates implemented in a Mnety 
of logic styles. The proposed macromodel for a logic gate is expressed in terms of the size 
and the topology of the gate itself, the size and the topology of the driving gate, the size 
of the loading gate or gates, and the interconnect capacitances. At this level, a pull-up or 
pull-down network dong the critical path of a gate is represented by parameters n, m, and 
q, as illustrated in Figure 9.1. In the delay and energy maaomodels for a DCVSL gate, we 
have included t he  effect of the race between the PMOS latch and the NMOS network dur- 
ing output switching. Neglecting the eff'ect of opposing currents may result in considerable 
underestimation of the delay and energy. We have derived t k e e  macromodels for a generd 
PTL c d  based on whether the critical path involves a gate drive, a source drive, or both, 
as illustrated in the figure. At the module level, to estimate the delay of a path containing 
mixed logic styles, we add the delays of the gates dong the path nsing the developed delay 
macromodels. 
9.1.2 Delay Optimization 
Using the logic-level delay macromodels, we have derived dosed-form formulas for the opti- 
mization of CMOS logic styles. For each logic style, there is an expression for the optimal 
sizing of the pull-up network and another one for the p d - d o m  network. In theh exact 
forms, these two expressions are not independent and, hence, are solved by a few itera- 
tions. However, we have &O approximated these expressions into simpler fornulas for quick 
optimieations, as listed in Table 9.1. In these fomnlas, all parameters, except v, are phys- 
ical and identifiable through the schematics. Using the optimal sizing formalas, we have 
demonstrated that it is feasible to optimize a circuit involving mixed CMOS logk sty1es. 













qlp Pull 1; y 1 
~ L P  Down u
Figure 9.1: Generd schematics of a conventional (top), DCVSL (middle), and PTL (bot tom) 
gate. 
CHAPTER 9. CONCL USION 
Table 9.1: Optimal transistor sizing in CMOS logic styles for minimizing the delay over 
one cycle. Notation: subscripts n (NMOS), p (PMOS), t (total NMOS+PMOS), D,G, S 
(drives), and L (load); Accents: ' (rising transition) and ' (f&g transition); A = fip/ ùn. 
) DCVSL 1 
1 PTL 1 , / n ~  WL, WC* WSP 
t;p Wsp nGp m+WtZp nSp q 1 *& 1 
For the special case of inverter circuits with a d o m  PMOS to NMOS width ratio r, 
we have shown that the following optimization formulas are valid for minimishg the total 
delay, rising delay, and faIlhg delay, respectively. 
Where W, WD, and WL are the widths of the NMOS transistor in a ref'erence inverter, its 
driving inverter, and it s Ioading inverter, respectively. The technology dependent parameter 
A is the PMOS-to-NMOS driveability ratio. In addition, we have been able to derive the 
foUowing relation for minimizing the energy-delay product, which is 
some VLSI circuits. 
For typical values of A such as 2.5, the above evsluates to 0.65. On 
formulation confirms that the total delay is minimum when P = fi, 
a design criterion in 
the other hand, Our 
whkh is around f .5. 
The formulation also clearly shows that in designing a chah of buffers for dnving a large load, 
tapering the bnffers by a constant factor is a necessary condition rathm than an arbitrary 
assumption. 
9.1.3 Circuits and Applications 
A considerable part of this work ha9 b e n  devoted to cornparhg different CMOS implemen- 
tations of logic gates. We have developed a fair method fa this ptupose. Based on this 
CHAPTER 9. CONCLUSION 173 
method we have proposed that two rules should be observed. The f i s t  d e  is that before 
evaluating dinerent implementations of a gate, each of them should be optimized for the par- 
ticular environment. The second d e  is that the performance of the implementations should 
be compared in light of their cost in terms of energy dissipation. Our optimal transistor 
sizing formulas facilitate fu l f ikg  the fist  d e .  To comply with the second rule, we have 
introduced the idea of using energy-delay and energy-fkequency graphs. Our methodology 
has been exernplified by applying it to the implementations of the C-element and XOR gate. 
We have studied the performance and energy consumption of eight CMOS implementa- 
tions of the C-element. Four out of the eight implementations belong to the single-rail family 
of logic styles, and aU have been used in practical circuits. The other four are based on a dif- 
ferential logic style introduced in this thesis under the acronym DIL. DIL resembles DCVSL 
in structure and operation but, unlike DCVSL, has a static memory. This property of DL 
makes it suitable for implementing primitives like the Celement and the TOGGLE. We have 
cornpared the performance and energy dissipation of the C-element implementations in two 
typical environments. The fist environment evaluates a Gelement opaating in isolation, 
while the second environment evaluates a group of C-elernents that are mutudy dependent. 
In both environments, the C-elements were optimized for their best performance. Results 
show t hat among the single-rail implement ations, the symmetric C-element by Van Berkel 
offers a better balance between performance and energy in both test environments. The 
reason is that, compared to the other single-rail implementations, the symmetric one ha9 
the least overhead for maintainhg the state of the output. This is realized by first, having 
a t opology t hat does not resist output switching and, second, having fewer keeper transis- 
tors. These studies demonstrated that minimizing the number of transistors dedicated to 
latching (keepers) and avoiding topologies that resist output switching may result in s i g d -  
icant energy savings. The symmetric implementation, however, is ontpedormed in the b s t  
test environment by two modified versions of the DIL implementation, n d y  DILP and 
D U .  In the second test environment, the symmetnc Celement is still preferred, because 
the diffeirential implementations may fail in structures with feedbadc loops due to the prob 
lem of divergence of complexnentary outputs. These stndies demonstrated that mhimbhg 
the number of transistors dedicated to latching (keepers) and avoiding topologies that resist 
output switehing may resdt in significant energy savings. 
CHAPTER 9. CONCLUSION 
We have also compared the conventional, DCVSL, and CPL implementations of the XOR 
gate. Since the performance and optimal transistor sizing of a gate are both hinctions of the 
supply voltage, we have added another dimension to our comparison technique for the case of 
the XOR gate. We have evaluated the optimized performance and the corresponding energy 
of the implementations for a range of power supply voltages. The results of the comparison 
are in favour of the CPL implementation followed by the conventional implementation of the 
XOR gate. 
9.2 Directions for Future Research 
Considering the scope of this work, it may be extended in a number of directions. This 
section outlines some relevant potential future work. 
We have paved the way for developing a CAD tool for logic simulation and optimization 
of digital CMOS circuits. This is the most obvious extension of our work. The tool would 
support circuits induding mixed CMOS logic styles. We are not aware of any such tool 
currently available. The results of our work indicate that the CAD tool would be a fast 
and accurate one. Resorting to a CAD tool is inevitable for delay estimation and optimizb 
tion at the system level. System level optimization requires speual considerations, b e k s e  
op timizing one pat h of the sys tem may increase the delay in another one. 
The applications of this work may be extended to cover additional asynchronous and 
synchronous primitives. The most important elements on the priority list indade latches and 
flipflops [28,106]. This opens the door for M a y  estimation and optimization of sequentid 
synchronous and asynchronous circuits. For example, it would then be possible to optimize 
a complete micropipeline structure, which consists of a control circuit and a data path. 
Another interesting application for this work is the area of dynamic CMOS circuits. 
The delay models developed in this thesis rnay be used to optimize mixed logic-style digi- 
tal CMOS circuits under mdti-constraints on the delay, energy, m a ,  and power supply volt- 
age. This requires an investigation into finding or developing efficient non-linear optimization 
algorithms for th is  purpose. The literatare seems he1ph.I in this regard [25,45,68,79]. 
Another interesting work would be to relate the panuneters of oar MOSFET ment 
CHAPTER 9. CONCLUSION 175 
mode1 to the basic technology factors such as the carrier mobilities and doping densities. 
An investigation into progressive transistor sizing of cascaded MO S transistors also seems 
interesting [8,88]. 
9.3 Publications That Arose from the Thesis 
We have aiready published some of the early results of this work presented in chapters 2 
through 5. The major results presented in chapters 6, 7, and 8, however, remain to be 
prepared for publication. 
a M. Shams, M. Elmasry, "A Formulation for Quick Evaluation and Optimization of 
Digital CMOS Circuits," to appear in IEEE International Symposium on Circuits and 
Systems, ISCAS-99, June 1999. 
a M. Shams, J. Ebergen, M. Elmasry, "Modeling and Cornparing CMOS Implemen- 
tations of the C-Element," IEEE Tkansactions on VLSI Systems, Special Issue on 
Low-Power Electronics and Design, pp 563-567, December 1998. 
a M. Shams, J. Ebergen, M. Elmasry, UAsynchronous Circuits," in Encyclopedio of Ekc- 
trical and Electronics Engàneeràng, Editor: J .  Webster, John Wiey, pp 716-725, March 
1999. 
M. Shams, J. Ebergen, M. EImasry, uOpthizhg CMOS Implementations of the G 
Element," in IEEE International Conference on Cornputer Design, ICCD-97, pp 700- 
705, October 1997. 
M. Shams, J. Ebergen, M. Eknasry, "Comparing CMOS Implementations of an Asyn- 
chronons Circuits Primitive: the GElement," in IEEE Internotioncd Spposium on 
Low-Power Electronics und Dei@, pp 93-96, August 1996. 
Bibliography 
[l] A. J.  Al-Khalili, Y. Zhu, and D. Al-Khalili, "A module generator for optimized CMOS 
bufiers," IEEE Z'kansactions on Cornputer-Aided Design, vol. 9, pp. 1028-1046, Oct . 
1990. 
[2] K. v. Berkel, "Beware the isochronic fork," Integration, the VLSI journal, vol. 13, 
pp. 103-128, June 1992. 
[3] K. v. Berkei, Handshake Circuits: an Asynchronous Architecture for VLSI Program- 
ming,  vol. 5 of International Sen'es on Porallel Computation. Cambridge University 
Press, 1993. 
[4] K. v. Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and F. Schalij, "A 
fdy-asynchronous low-power error corrector for the DCC player," IEEE Journal of 
Solzd-State Circuits, vol. 29, pp. 1429-1439, Dec. 1994. 
[5] K. v. Betkel, R. Burgess, J. Kesseh, A. Peeters, M. Roncken, and F. Schalij, My-  
aoynchronous low-power error corrector for the DCC player," in International Solid 
State Ci~cuits Conference, pp. 88-89, Feb. 1994. 
[6] K. v. Berkel and M. Rem, V L S I  programming of asynehronous &cuits for low power," 
in Asynch~onow Digital Circuit Design (G. Birtwistle and A. Davis, eds.), Worhhops 
in Compating, pp. 152-210, Springer-Verlag, 1995. 
[7] L. Bisdounis, S. Nikolaidis, and 0. Konfopavloa, UAnalytical transient response and 
propagation delay evalaation of the CMOS inverter for short-channel devices," IEEE 
Journal of Solid-Stote Ci~cuits, vol. 33, pp. 302-306, Feb. 1998. 
[8] S. S. Bizzan, G. A. Jullien, and W. C. Miller, "Analytical approach to siauig nFET 
chians," Electronics Letters, vol. 28, pp. 1334-1335, July 1992. 
[9] E. Brunmd and R. F. Sproull, "Translating concurrent programs into delay-insensitive 
circuits," in P roc. International Conf. Cornputer-Aided Design (ICCAD), pp. 262-265, 
IEEE Computer Society Press, Nov. 1989. 
[IO] J. A. Brzozowski and C.-J. H. Seger, Asynchronous Circuits. Springer-Verlag, 1995. 
[Il] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-power CMOS digital de- 
sign," IEEE Journal of Solid-State Ciicuits, vol. 27, pp. 473484, Apr. 1992. 
[12] T. J. Chaney and C. E. Molnar, "Anornalaus behavior of synchronizer and arbiter 
circuits," IEEE Tbansactions o n  Cornputers, vol. (2-22, pp. 421-422, Apr. 1973. 
[13] K. Chen and C. Hu, "Performance and Vdd scaling in deep snbmicron CMOS," IEEE 
Journal of SOM-State Circuits, vol. 33, pp. 1586-1589, Oct. 1998. 
[14] K. Chen, C. Hu, P. Fang, M. R. Lin, and D. L. Wollesen, 'Predicting CMOS speed with 
gate oxide and volatge scaling and interconnect loading effects," IEEE fiansactions 
on Electron Devices, vol. 44, pp. 1951-1957, Nov. 1997. 
[15] K. Choi, K. Lee, and LW. Kang, "A self-timed divider ushg RSD number system," in 
Pm. International Conf. Cornputer Design (ICCD), IEEE Computer Society Press, 
Oct. 1994. 
[16] K. Chu and D. PuEey, 'Design procedures for diffkrential cascode voltage switch 
circuits," IEEE Journal of Solid-Stote C i ~ m i t s ,  vol. 21, pp. 1082-1087, Dec. 1986. 
[l?] K. Chu and D. Puliiey, 'A cornparison of CMOS circuit techniques: Differential cas- 
code voltage smtch  logic versus conventional logic," IEEE Journal of Solid-State Cir- 
cuits, vol. 22, pp. 528-532, Aug. 1987. 
[l8] T.- A. Chu, Synthesis of Self- Timed VLSI Ci~cuits  fiom G~aph-  Theoretic Speeifications. 
P hD thesis, MIT Laboratory for Cornpater Science, Jtme 1987. 
BIBLIO GRAPHY 178 
[19] M. A. Cirit, "'Ikansistor sizing in CMOS circuits," in Proc. ACM/IEEE Design Au- 
tomation Conference, pp. 121-124, ACM, 1987. 
[20] W. A. Clark, "Macromodular computer systems," in AFIPS Conference Proceedings: 
1967 Spring Joint Cornputer Conference, vol. 30, (Atlantic City, NJ), pp. 335-336, 
Academic Press, 1967. 
[21] W. A. Clark and C. E. Moinar, "Macromodular computer systems," in Cornputers in 
Biomedical Research (R.  W .  Stacy and B. D. W m a n ,  eds.), vol. N, ch. 3, pp. 45-85, 
Academic Press, 1974. 
[22] B. Coates, A. Davis, and K. Stevens, "The Post Office experience: Designing a large 
asynchronous chip ," Integrution, the VLSI journal, vol. 15, pp. 341-366, Oct . 1993. 
[23] P. Cocchini, G. Piccinini, and M. Zamboni, "A comprehensive submicrometer MOST 
delay mode1 and its application to CMOS bdkr," IEEE Journal of Solid-State Circuits, 
vol. 32, pp. 1254-1262, Aug. 1997. 
[24] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth, "On the 
lambert W function," Tech. Rep. CS-93-03, University of Waterloo, Mar. 1993. 
[25] 0. Coudert, "Gate sizing for constrained delay/power/area optimization," IEEE 
7kamactions on VLSI Systems, vol. 5, pp. 465-472, Dec. 1997. 
1261 A. Davis, "S ynt hesizing asynchronous circuits: Practice and experience," in As yn- 
chronow Digital Circuit Design (G. Birtwistle and A. Davis, eds.), Workshops in 
Computing, pp. 104-150, Springer-Verlag, 1995. 
1271 A. Davis and S. M. Nowick, "Asynchronous circuit design: Motivation, background, 
and methods," in Asynchronous Digital Ci~cuit Design (G. Birtwistle and A. Davis, 
eds.), Workshops in Computing, pp. 1-49, Springer-Verlag, 1995. 
[28] P. Day and J. V. Woods, "Investigation into micropipeline latch design styles," IEEE 
fiansactions on VIS1 Systems, vol. 3, pp. 264-272, June 1995. 
1291 D. L. DU, Dace Theory for Automatic Hiera~chicd Verification of Speed-lndependent 
Circuits. ACM Disthgaished Dissertations, MIT Press, 1989. 
[30] D. W. Dobberpuhi and et. al., "A 200-mhz 6 4 b  dual-issue cmos microprocessor," IEEE 
Journal of Soiid-State Circî~its, vol. 27, pp. 1555-1568, Nov. 1992. 
[31] S. Dutta, S. S. Mahant, and S. L. Lusky, &A comprehensive delay mode1 for CMOS 
inverters," IEEE Journal of Solid-State Circuits, vol. 30, pp. 864-871, Aug. 1995. 
[32] J .  Ebergen and S. Gingras, "A verifier for network decompositions of command-based 
specifications," in Proc. Hawaii International Conf. System Sciences, vol. 1, IEEE 
Computer Society Press, Jan. 1993. 
[33] J. C. Ebergen, Tkanslating Programs into Delay-Insensitive Circuits, vol. 56 of CWI 
Tract. Centre for Mathematics and Computer Science, 1989. 
[34] J.  C. Ebergen, J.  Segers, and 1. Benko, "Pardel program and asynchronous circuit 
design," in Asynchronous Digital Circuit Design (G. Birtwistle and A. Davis, eds.), 
Workshops in Computing, pp. 51-103, Springer-Verlag, 1995. 
[35] W. C. Elmore, "The transient response of damped linesr networks with particdar 
regard to wideband amplifies," Journal of Applied Physics, vol. 19, pp. 55-63, Jan. 
1948. 
[36] J. P. Fishbarn and A. E. Dunlop, "Tilos: A posynomial programming approach to tran- 
sistor sizing," in PTOC. International Conf. Cornputer-Aided Design (ICCAD), pp. 326- 
328, IEEE Computer Society Press, Nov. 1985. 
[37] S. Furber, "Computing withont docks: Micropipelinhg the ARM processor," in Asyn- 
ch~onow Digital Circuit Design (G. Birtwistle and A. Davis, eds.), Workshops in Corn- 
pating, pp. 211-262, Springer-Verlag, 1995. 
(381 J. Gar side, "The Asynchronous 
[39] S. Hauck, UAsynchronous design methodologies: An overview," P~oceeding~ of the 
IEEE, vol. 83, Jan. 1995. 
[40] N. Hedenstierna and K. O. Jeppson, "CMOS &cuit speed and b&er optimization," 
IEEE Tbawadions on Compter-Aided Desâgn, vol. CAD-6, pp. 270-281, Mar. 1987. 
BIB L I 0  GRA PHY 180 
1411 K. S. Hedlund, "Aesop: A tool for automated transistor sizing," in Proc. ACM/IEEE 
Design Automation Conference, pp. 114-120, ACM, 1987. 
1421 L. G. HeIler, W. R. Griffin, J.  W. Davis, and N. G. Thoma, "Cascode voltage switching 
logic: A differential CMOS logic family," in International Solid State C imi t s  Confer- 
ence. pp. 16-17. 1984. 
1431 C. A. R. Hoare, Communicatzng Sequential Processes. Prentice-Hall, 1985. 
[44] L. A. Hollaar , UDirec t implementation of asynchronons control units," IEEE Tkansac- 
tions on Computers, vol. (2-31, pp. 1133-1141, Dec. 1982. 
[45] B. Hoppe, G. Neuendorf, D. Schmitt-Landsiedel, and W. Specks, UOptimization of 
high-speed cmos logic circuits with analytical models for signal delay, chip area, and 
dynamic power dissipation," IEEE Ransactions on Cornputer-Aided Design, vol. 9, 
pp. 236-247, Mar. 1990. 
[46] M. Horowitz, "Timing models for mos pass networks," in Proc. International Sympo- 
sium on Circuits and Systems, pp. 198-201, IEEE, 1983. 
[47] C. Hu, "Device and technology impact on low power electronics," in Low-Power Design 
Methodologies (J. M. Rabay and M. Pedram, eds.), pp. 317-322, Kluwer Academic 
Publishers, 1996. 
[48] D. A. Hnffman, "The synthesis of sequential switching circuits," IRE IPrcwactions on 
Electronic Computers, vol. 257, no. 3 & 4, 1954. 
[49] K. O. Jeppson, 'Modeling the influence of the transistor gain ratio and the input-te 
output coupling capacitance on the CMOS inverter delay," fEEE Joumd of Solid-State 
Circuits, vol. 29, pp. 646-654, June 1994. 
[50] M. B. Josephs and J. T. Udding, "An overview of DI algebra,' in Pmc. Hawaii Inter- 
nationcl Conf. System Sciences, vol. 1, IEEE Cornpater Society Press, Jan. 1993. 
[51] S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits Analyshs and Design. 
McGraw-Hill, 1996. 
(521 U. Ko, P. T. Balsara, and W. Lee. "Low-power design techniques for high-pedormance 
CMOS adders," IEEE ?barasactions on VLSI Systems, vol. 3, pp. 327-332, June 1995. 
[53] N. C. Li, G. L. Haviland, and A. A. Tuszynski, "CMOS tapered bder ,"  IEEE Journal 
of Solid-State Circuits, vol. 25, pp. 1005-1008, Aug. 1990. 
[54] L. R. Marino, "General theory of metastable opaation," IEEE If.ansactions on Com- 
puters, vol. C-30, pp. 107-115, Feb. 1981. 
[55] A. J .  Martin, "Formal program transformations for VLSI circuit synthesis," in Formal 
Development of Program and P~oofs (E. W. Dijkstra, ed.), UT Year of Programming 
Series, pp. 59-80, Addison-Wesley, 1989. 
(561 A. J. Martin, "Prograrnming in VLSI: Fkom communicating processes to delay- 
insensitive circuits," in Developments in Concurrency and Communication (C. A. R. 
Hoare, ed.), UT Year of Programming Series, pp. 1-64, Addison-Wesley, 1990. 
[57] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, and P. J. Hazewindus, "The design 
of an asynchronous microprocessor," in Advanceci Resea~ch in VLSI: Proceedzngs of 
the Decennid Caltech Conference on VLSI (C. L. Seitz, ed.), pp. 351-373, MIT Press, 
1989. 
[58] E. J. McCluskey, "Fundamental mode and puise mode sequentid circuits." in PTOC. of 
IFIP Congress 62, pp .  725-730, North-HoIland, 1963. 
[59] R. Mehrotra, M. Pedram, and X. Wu, "Cornparison betwen nMOS pass transistor logic 
styles vs. cmos complementary cells," in Proc. International Conf. Computer D&gn 
(ICCR), pp. 130-135, IEEE Cornpater Society Press, Oct. 1997. 
[60] T. H.-Y. Meng, Asynchronow Design for Digital Signal Processing A~chitectures. PhD 
thesis, UC Berkely, 1988. 
[61] T. H.-Y. Meng, R. W. Brodersen, and D. G. Messerschmitt, "Automatic synthesis of 
asynchronous circuits from high-level speafications," IEEE lhansactions o n  Computer- 
Aided Design, vol. 8, pp. 1185-1205, Nov. 1989. 
BIBLIOGRAPHY 182 
[62] G. Merkel, J . Borel, and N. Z. Cupcea, "An accuate large signal MOS transistor model 
for use in cornputer-aided design," IEEE Transactions o n  Electron Devices, 1972. 
[63] R. E. Miller, Sequentzal Circuits and Machines, vol. 2 of Switching Theory. John Wiley 
& Sons, 1965. 
[64] C. E. Molnar, 1. W. Jones, B. Coates, and J. Lexau, 'A FIFO ring oscillator per- 
formance experiment," in Proc. International Symposium on Aduanced Research in 
Asynchronow Circuits and Systems, IEEE Computer Society Press, Apr. 1997. 
[65] D. E. Muller and W. S. Bartky, "A theory of asynchronous circuits," in Proceedings 
of an International Symposium on the Theory of Swàtchàng, pp. 204-243, Hsrvsrd 
University Press, Apr . 1959. 
[66] A. Nabavi-Lishi and N. C. Rumin, "Inverter models of CMOS gates for supply cur- 
rent and delay evaluation," IEEE Tkansaetions on Cornputer-Aided Design, vol. 13, 
pp. 1271-1279, Oct. 1994. 
(671 S. M. Nowick and D. L. Dill, "Automatic synthesis of locaily-clocked asynchronous 
state machines," in Proc. International Conf. Computer-Aided Design (ICCAD), 
pp. 318-321, IEEE Computer Society Press, Nov. 1991. 
[68] P. Pant, V. K. De, and A. Chatterjee, "Simultaneous power supply, threshold voltage, 
and t ramis tor size op t imization for low-power operation of CMOS circuits," IEEE 
Tkansactions on VLSI Systems, vol. 6, pp. 538-545, Dec. 1998. 
1691 J. R. Pasternak and C. A. T. Salama, UDesign of submicron CMOS diffe~ential pass- 
transistor logic circuits," IEEE Journal of Solid-State Circuits, vol. 26, pp. 1249-1258, 
Sept. 1991. 
[70] A. Peeters, "The 'Asynchronous' Bibliography ( B I B ~ )  database file async.bib." 
f t p  : / / f tp . win. tue. nl/pub/tar/async . bib . 2. Correspondhg e-mail address: 
async-bibhin. tue. ni. 
[71] J. L. Peterson, "Petri nets," Computing Smeys, vol. 9, pp. 223-252, Sept. 1977. 
[72] J. M. Rabaey, Digital Integrcted Circuits. PrenticeHaIl, 1996. 
[73] M. Renaudin and B. E. Hassan, "The design of fast asynchronous adder structures 
and thek implementation using DCVS logic," in Proc. International Symposium on 
Circvits and System, 1994. 
[74] J. Rubinstein, P. Penfield, and M. Horowitz, "Signal delay in RC tree networks," IEEE 
Transactzons on Cornputer-Aided Deszgn, vol. 2 ,  pp. 202-211, J d y  1983. 
(751 A. E. Ruehli, P. K. Wolff, and G. Goertzel, "Analytical power/timing opimization 
technique for digit al sys t em ," in Proc. A CM/IEEE Design Automation Conference, 
pp. 142-146, ACM, 1977. 
[76] T. Sakurai and A. R. Newton, "Alpha-power law mosfet model and its applications 
to CMOS inverter delay and other formulas," IEEE Journal of Solid-Stote Circuits, 
vol. 25, pp. 584-594, Apr. 1990. 
[77] T. Sakurai and A. R. Newton, "Delay analysis of series-connected mosfet circuits," 
IEEE Journal of Solid-State Circuits, vol. 26, pp. 122-131, Feb. 1991. 
[78] T. Sakurai and A. R. Newton, "A simple mosfet model for circuit analysis," IEEE 
ZYansactions on Electron Devices, vol. 38, pp. 887-894, Apt. 1991. 
[79] S. S. Sapatnekar, V. B. Rao, P. M. Vaidya, and S.-M. Kang, "An exact solution to 
the transistor sizing problem for CMOS circuits using convex optimization," IEEE 
T~ansactions on Cornputer-Aided Design, vol. 12, pp. 1621-1634, Nov. 1993. 
[80] C. L. Seitz, "System timing," ir. Introduction to VLSI Systems ( C .  A. Mead and L. A. 
Conway, eds.), ch. 7, Addison-Wesley, 1980. 
[81] M. Shams, J. Ebergen, and M. Elmasry, &A cornparison of CMOS implementations 
of an asynchronous circuits primitive: the C-element," in International Symposium on 
Low Power Eledronics and Design, pp. 93-96, Aug. 1996. 
[82] M. Shams, J. C. Ebergen, and M. 1. Eknasry, 'Optimipng CMOS implementations of 
C-element ," in Proc. International Conf. Cornputer Design (ICCD), pp. 700-705, Oct . 
1997. 
BIBLIOGRAPHY 184 
[83] -M. Shams. J. C. Ebergen, and M. 1. Ehasry, "Modeling and comparing CMOS imple- 
mentations of the C-element," IEEE Dansactions on VLSI Systems, vol. 6, pp. 563- 
567. Dec. 1998. 
[84] M. Shams, J.  C. Ebergen, and M. 1. Ehasry, 'Asynchronous circuits," in Encyclopedia 
of Electrical and Electronics Engineering (J. Webster, ed.) ,  John Wiley & Sons, 1999. 
[851 M. Shams and M. 1. Ehasry, "A formulation for qui& evaluation and optimization of 
digit al CMOS circuits," in PTOC. International Symposium on Cimrits and Systems, 
p. (To appear), May 1999. 
[86] B. J. Sheu, D. L. Scharfetter, P.-K. Ko, and M.-C. Jeng, 'Bsim: Berkeley short- 
channel igfet model for mos transistors," IEEE Journal of Solid-State Circuits, vol. SC- 
22, pp. 558-563, Aug. 1987. 
[87] W. Shockley, &A unipolar field &ect transistor," Proc. IRE, vol. 40, pp. 1365-1376, 
Nov. 1952. 
[88] M. Sho ji, "FET scaling in domino CMOS gates," IEEE Journal of Solid-Stute Circuits, 
vol. 20, pp. 1067-1071, Oct. 1985. 
[89] M. Shoji, CMOS Digital Circuit Technology. Prentice-Hall, 1988. 
[90] LM. Shyu, A. Sangiovanni-Vincentelli, J .  P. Fishburn, and A. E. Dunlop, 
"Optimization-based transistor sizing," IEEE Journal of Solid-State Circuits, vol. 23, 
pp. 400409, Apr. 1988. 
[91] R. F. Sproull and 1. E. Sutherland, Asynchronow Sys t em.  Palo Alto: Sutherland, 
Sprod  and Associates, 1986. Vol. 1: Introduction, Vol. II: Logical effort and asyn- 
chronous modules, Vol. III: Case studies. 
[92] 1. E. Sutherland, "Micropipelines," Communications of the A C ' ,  vol. 32, pp. 720-738, 
[93] 1. E. Sutherland and R. F. Sproull, "Logid dort: Desiging for speed on the back of 
an envelope," in Advanced Research in VLSl, pp. 1-16, Sept. 1991. 
[94] M. Suzuki, N. Ohkubo, T. Shinbo, T. Yamanaki, A. Shimizu, K. Sasaki, and Y. Nak- 
agome, "A 1.5-11s 32-b CMOS ALU in double pas-transistor logic," IEEE Journal of 
Solid-State Circuits, vol. 28, pp. ll45-lI51, Nov. 1993. 
[95] K.-Y. Toh, P.-K. Ko, and R. G. Meyer, "An engineering mode1 for short-charnel MOS 
devices," IEEE Journal of Solid-State Circuits, vol. 23, pp. 950-958, Aug. 1988. 
[96] J . T .  Udding, Clwsif i t ion and Composition of Dela y- Insensitive Ci~cuits.  PhD thesis, 
Dept. of Math. and C.S., Eindhoven Univ. of Technology, 1984. 
(971 S .  H .  Unger, Asynchronow Sequential Switching Circuits. New York: Wiley- 
Interscience, John Wiley & Sons, Inc., 1969. 
[98] S. R. Vernuru and A. R. Thorbjornsen, "Variable-taper CMOS buffer," IEEE Journal 
of Solid-State Cimrits, vol. 26, pp. 1265-1269, Sept. 1991. 
[99] T. Verhoeff, A Theory of Delay-Insensitive Systems. PhD thesis, Dept. of Math. and 
C.S., Eindhoven Univ. of Technology, May 1994. 
[IO01 2. Wang, G. A. Jullien, W. C. Miller, J. Wang, and S. S. Bizzan, uFast adders using 
enhanced multiple-output domino logic," IEEE Journal of Solid-State Circuits, vol. 32, 
pp. 206-214, Feb. 1997. 
[101] S. Weber, B. Bloom, aud G. Brown, YCompiling Joy to silicon,?' in Proceedzngs 
of Broum/MIT Confe~ence on Aduanced Reseorch in VLSI and Purollel Systems 
(T. Knight and J. Savage, eds.), pp. 79-98, MIT Press, Mar. 1992. 
[IO21 N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design. Addison-Wesley, 
1994. 
[103] T. E. Williams and M. A. Horowitz, "A zereoverhead self-timed 160ns 54b CMOS 
divider," IEEE Journal of Solid-State Circuits, vol. 26, pp. 1651-1661, Nov. 1991. 
[104] L. T. Wurtz, "An efficient procedure for domino CMOS logic," IEEE Journal of Solid- 
State Cimits,  vol. 28, pp. 979-982, Sept. 1993. 
[loti] K. Yano, T. Yamanka, T. Nishida, M. Saito, K. Shimohigashi, and A. Shimizu, "A 
3.û-ns CMOS 16x16-b multiplies using complementary pass-transistor logic," IEE& 
Journal of Sotid-Stote Circuits, vol. 25, pp. 388-395, Apr. 1990. 
[106] J. Yuan and C. Svensson, "New single-dock CMOS latches and fiipflops with improved 
speed and power savings," IEEE Jozlrnal of Solid-State Ci~cuits, vol. 32, pp. 62-69, 
Jan. 1997. 
[IO71 K. Y. Yun, P. A. Beerei, and J. Arceo, "High-performance asynckonous pipeline 
circuits," in PTOC. international Symposium on Advunced Research in Asynch~onovs 
Circuits and Systems, IEEE Computer Society Press, Mar. 1996. 
[108] K. Y. Yun and D. L. DU, "Automatic synthesis of 3D asynchronous state madiines," 
in Proc. International Conf. Comput er-Aided Design (ICCAD), pp . 576-580, IEEE 
Computer Society Press, Nov. 1992. 
(1091 R. Zimmermann and W. Fichtner, "Low-power logic styles: CMOS versus pass- 
transistor logic," IEEE Journal of Solid-State Ci~cuits, vol. 32, pp. 1097-1090, July 
1997. 
