Cmos Circuit Speed and Power Optimization Using Simplified Rc Delay Model by Lakkakula, Sunil Kumar
CMOS CIRCUIT SPEED AND POWER OPTIMIZATION 
USING SIMPLIFIED RC DELAY MODEL 
 
 
   By 
      SUNIL KUMAR LAKKAKULA 
   Bachelor of Technology 
Electrical & Electronics Engineering  
   Acharya Nagarjuna University 
   Guntur, Andhra Pradesh, India 
   2007 
 
   Master of Science 
Electrical Engineering  
   Oklahoma State University 
   Stillwater, Oklahoma 
   2009 
 
 
   Submitted to the Faculty of the 
   Graduate College of the 
   Oklahoma State University 
   in partial fulfillment of 
   the requirements for 
   the Degree of 
   DOCTOR OF PHILOSOPHY 
   May, 2015
ii 
 
CMOS CIRCUIT SPEED AND POWER OPTIMIZATION 
USING SIMPLIFIED RC DELAY MODEL 
 
 
   Dissertation Approved: 
 
   Dr. Louis G. Johnson 
  Dissertation Adviser 
   Dr. Gary Yen 
 
   Dr. Rama Ramakumar 
 
   Dr. Blayne Mayfield 
iii 
Acknowledgements reflect the views of the author and are not endorsed by committee 
members or Oklahoma State University. 
ACKNOWLEDGEMENTS 
 
First, I would like to thank my adviser Dr. Louis G. Johnson for his wonderful support, 
guidance and encouragement throughout my PhD program. I am honored to be his PhD 
student. I also thank him for providing me with financial assistance through Graduate 
Research Assistantship during most of my PhD program. 
I would like to thank my committee members Dr. Gary Yen, Dr. Rama Ramakumar and 
Dr. Blayne Mayfield for serving on my committee and providing their valuable 
comments and feedback on my work. 
I am also thankful to the Department of Campus Life and the Service-Learning Volunteer 
Center at Oklahoma State University for allowing me to work as a Graduate Teaching 
Assistant during which I have learning great community skills and volunteer skills. 
I am also very grateful and thankful to the following people who played a major role in 
my life during my stay in the United States. Firstly my friends Rajashekhar Yaramasu, 
Siddarth Kota, Brian Joseph, Dr. Satyanarayana Achanta, Satish Bhattiprolu, Satyashil 
Shinde, Dr. Anand Govindarajan, Dr. Upasana Manimegalai Sridhar, Dr. Kumar 
Singarapu, Jeet Turakhia, Suresh Kumar Jayaraman, Vijayalakshmi Sethuraman, 
Samyukta Koteeswaran, Ram Kumar Isakki for their support and cooperation. Also, Tim 
Huff, Regina Henry, Ruthie Loffi, Kent Sampson, Joyce Montgomery, Marie Basler from 
the Department of Campus Life for their valuable support and guidance for me in 
iv 
Acknowledgements reflect the views of the author and are not endorsed by committee 
members or Oklahoma State University. 
learning leadership and service skills. My labmates Dr. Julius Marpaung, Wira Mulia for 
their assistance in learning new technical skills. My American friends Alice Sharrock, 
Russ Sharrock, Janina Graves, Mason Williams, Grice’s Family for their valuable 
friendship. My relatives in the United States for welcoming me and my wife into their 
homes with great love and affection. 
Finally, I thank my parents Nageswara Rao Lakkakula and Lakshmi Lakkakula for their 
constant support and trust in me without which I have not come this far and be able to 
finish my PhD. I thank my brother Anil Kumar Lakkakula for his blessings from heaven. 
I also thank my friend and sister Surekha Bathula for her support. Last but not the least I 
thank my wife Mounika Tiruamalsetty for being with me and supporting me in finishing 
my PhD successfully.
v 
 
.Name: SUNIL KUMAR LAKKAKULA   
 
Date of Degree: May, 2015 
  
Title of Study: CMOS CIRCUIT SPEED AND POWER OPTIMIZATION USING 
SIMPLIFIED RC DELAY MODEL 
 
Major Field: ELECTRICAL ENGINEERING 
 
Abstract: A simplified RC delay model which is expressed explicitly in terms of 
transistor widths is presented to easily perform circuit analysis and quickly optimize the 
transistor widths for delay and power without having to do tedious layout or schematic 
simulation. A novel heuristic gradient descent method is used to effectively solve the 
optimization problem. Popular parallel prefix adders such as Brent-Kung, Skylansky and 
Kogge-Stone adders are modeled using the proposed simplified RC delay model and a 
power-delay performance comparison is done after optimization to show the ability of the 
model. The optimization results suggest that the Brent-Kung adder is more efficient in 
terms of both delay and power by having the lowest power-delay product followed by the 
Skylansky adder and then the Kogge-Stone adder. 
 
 
vi 
 
TABLE OF CONTENTS 
 
Chapter          Page 
 
I. INTRODUCTION ......................................................................................................1 
 1.1 Background ........................................................................................................2 
       1.1.1 MOSFET Operation and Characteristics ..................................................2 
       1.1.2 Circuit Delay and Power ...........................................................................4 
 
II. REVIEW OF LITERATURE....................................................................................6 
 2.1 Review of device models ...................................................................................6 
 2.2 Review of the Method of Logical Effort............................................................9 
 2.3 Review of the gradient descent based optimization .........................................13 
 2.4 Review of optimization techniques based on transistor sizing ........................15 
 2.5 Review of parallel prefix adders and their performance ..................................18 
 
III. METHODOLOGY ................................................................................................21 
 3.1 Simplified RC delay Model .............................................................................21 
 3.2 Power Model ....................................................................................................31 
 3.3 Optimization Methodology ..............................................................................33 
 3.4 Model Validation .............................................................................................31 
       3.4.1 Lightly loaded case .................................................................................36 
       3.4.2 Heavily loaded case ................................................................................37 
 3.5 Convexity of the objective function .................................................................38 
 
IV. PERFORMANCE EVALUATION & OPTIMIZATION RESULTS ...................40 
 4.1 Parallel prefix adders .......................................................................................40 
 4.2 Optimization results .........................................................................................43 
 
V.  CONCLUSION ......................................................................................................46 
 
REFERENCES ............................................................................................................47 
 
APPENDICES .............................................................................................................51 
 Appendix A ............................................................................................................51 
 Appendix B ............................................................................................................56 
 Appendix C ............................................................................................................61 
 
vii 
 
LIST OF TABLES 
 
 
Table           Page 
 
   Table 2.1 Logical effort for inputs of static CMOS gates, assuming γ=2 ................10 
   Table 2.2 Parasitic delay estimates of different logic gates ......................................11 
   Table 3.1 Parasitic delay expressions for 2-input NAND gate .................................22 
   Table 3.2 Capacitance values for various processes .................................................25 
   Table 3.3 CPU time comparison ...............................................................................38
viii 
 
LIST OF FIGURES 
 
Figure           Page 
 
   Figure 1.1: Cross section of nMOS transistor .............................................................2 
   Figure 1.2: I-V characteristics of nMOS transistor .....................................................3 
   Figure 1.3: Plot defining circuit delay ........................................................................4 
   Figure 1.4: Example RC Circuit for Elmore Delay ....................................................5 
   Figure 2.1: Switched-Resistor Transistor Model ........................................................8 
   Figure 2.2: Piecewise Linear Model ...........................................................................8 
   Figure 2.3: FO4 – Inverter driving four identical inverters ......................................11 
   Figure 2.4: Logic network consisting of three two-input NAND gates ....................12 
   Figure 3.1: 2-input NAND gate ................................................................................22 
   Figure 3.2: Channel Propagation Capacitance ..........................................................24 
   Figure 3.3: Diffusion Capacitance with diffusion contact ........................................26 
   Figure 3.4: Diffusion Capacitance with no diffusion contact ...................................26 
   Figure 3.5: Diffusion Capacitance with shared diffusion contact.............................27 
   Figure 3.6: 2x4 Decoder circuit with enable signal ..................................................29 
   Figure 3.7: 3-input NAND gate and inverter ............................................................29 
   Figure 3.8: OF Surface when minimizing delay…………………………………...34 
   Figure 3.9: Four inverter chain with a light load and no Cwire ..................................36 
   Figure 3.10: Model Vs SPICE power-delay plot for lightly loaded case .................36 
   Figure 3.11: Four inverter chain with a heavy load and large Cwire ..........................37 
   Figure 3.12: Model Vs SPICE power-delay plot for heavily loaded case ................37 
   Figure 4.1: 16-bit Brent-Kung adder ........................................................................42 
   Figure 4.2: 16-bit Skylansky adder ...........................................................................42 
   Figure 4.3: 16-bit Kogge-Stone adder ......................................................................42 
   Figure 4.4: Power-delay comparison plot for 32-bit adder case ...............................44 
   Figure 4.5: Power-delay comparison plot for 64-bit adder case ...............................44 
   Figure 4.6: Power-delay comparison plot for 128-bit adder case .............................45 
 
 
1 
 
CHAPTER I 
 
 
INTRODUCTION 
A major concern in VLSI circuit design is the delay and power dissipation of the circuit. 
Circuit optimization based on transistor sizing is one very useful method to optimize the circuit 
for speed and power dissipation to achieve better needed circuit performance. Hence a model 
which can be expressed explicitly in terms of transistor widths is required to effectively analyze 
the effects of changing the transistor widths on circuit speed and power to allow flexible trade-
off. Using RC transistor models to model and simulate a CMOS circuit is popular because of its 
simplicity and small computational burden compared to complex and computationally expensive 
BSIM models that are used in SPICE [1] simulators. This dissertation primarily presents the 
implementation of a simplified RC delay model to quickly estimate the CMOS circuit delay and 
its application in the circuit speed and power optimization by correctly sizing the transistor 
widths. In many cases there are several possible critical paths for the circuit to be optimized. 
There is no simple solution for optimum transistor sizes when multiple paths are optimized 
simultaneously. For better accuracy numerical techniques are needed to solve the optimization 
problem. It is much easier to use the numerical techniques with our simplified RC delay equations 
rather than SPICE for the following reasons: the delays are explicit functions of transistor widths 
making it obvious what happens when any width is changed; the width dependence of the delays 
is smooth so that simple numerical methods can be used to quickly find the optimum transistor 
widths; if the optimum widths found do not meet the specification, then there is probably no 
feasible solution and the specification must be changed.
2 
 
Since addition forms the basis for many processing operations and adder circuits are of great interest 
to digital system designers, popular prefix adders such as Brent-Kung[2], Skylansky[3] and Kogge-
Stone[4] adders are modeled using the proposed simplified RC delay model in this dissertation and a 
power-delay performance comparison was done to show the ability of the method in quickly 
optimizing the complex circuits like adders. 
1.1 Background 
Now, let’s look at some back ground to understand how the MOS transistors work and how the circuit 
delay and power are calculated. 
1.1.1 MOSFET Operation and Characteristics 
A Complementary Metal Oxide Semiconductor (CMOS) circuit consists of two types of transistors, 
nMOS transistor and a pMOS transistor. nMOS is called n-channel MOSFET and the majority charge 
carriers are electrons. pMOS is called p-channel MOSFET and the majority charge carriers are holes. 
The design and operation of an nMOS transistor is discussed below.  
In nMOS, the source and drain are n-type regions and the body is p-type region. With 
sufficient positive voltage at the gate, holes from the p-type body are driven away from the gate, 
forming an n-type channel between the p-type body and the oxide. This channel extends between the 
source and the drain, when a voltage is applied between the drain and the source current is conducted 
through it. Figure 1.1 shows the cross section of an nMOS transistor without and with a channel 
formation. 
The operation of a MOSFET is categorized into three different regions or modes. They are Cut-off, 
Ohmic or linear and Saturation. Let us discuss the operation regions of an nMOS transistor. 
Cut-off region (VGS< Vth): In this region the gate to source bias VGS is less than the threshold voltage 
Vth and there is no significant conduction between drain and source. Hence the transistor can be 
considered as turned off in this region and the drain current is equal to zero. 
3 
 
 
 
 
 
 
 
 
Figure 1.1: Cross section of nMOS transistor (a) without channel (b) with channel 
Ohmic region (VGS>Vth and VDS< VGS-Vth): In this region the gate to source bias is greater than the 
threshold voltage and the drain to source bias is less than the gate voltage VGS-Vth. The transistor is 
turned on and the current conduction takes place between drain and source. In this region as the drain 
to source voltage increases the drain current (ID) also increases. 
Saturation region (VGS>Vth and VDS> VGS-Vth): In this region the gate to source bias is greater than 
the threshold voltage and the drain to source bias is greater than the gate voltage VGS-Vth. The 
transistor is turned on and the current conduction takes place between drain and source. In this region 
as the drain to source bias voltage increase there is almost no significant increase in the value of the 
drain current.  
The current vs voltage (I-V) characteristics of an nMOS transistor are shown in Figure. 1.2. 
 
 
 
 
 
 
Figure 1.2: I -V characteristics of nMOS transistor 
(a) (b) 
4 
 
In pMOS, the source and drain are p-type regions and body is n-type. The working of a pMOS 
transistor is similar to nMOS but with a negative voltage applied at its gate terminal. 
 
1.1.2 Circuit Delay and Power 
Delay: The usual definition of the circuit delay is the time difference between the half-Vdd point of 
the circuit’s input voltage waveform and the half-Vdd point of the circuit’s output voltage waveform. 
 
 
 
 
 
 
 
 
 
 
 
Figure 1.3: Plot defining circuit delay 
From the above figure, delay = t – t1. To determine the delay of a circuit accurately the exact 
waveforms of the input and output are required. The process of determining the exact waveforms of 
input and output is computationally expensive. Hence Elmore[5] defined the delay at node e, tde, as 
the centroid of the output impulse response e’(t) curve which can be found independently of the exact 
waveforms. It is given by the expression shown in Eq. (1.1). It is also called the first moment of the 
impulse response. Elmore delay is a good approximation for delay in tree RC networks. 
                      (1.1)  



k e
k
kekde
V
V
CRdtttetDelay
0
)()( '
time, t 
V
in
 
V
dd
 
V
dd
/2 
t
1
 
time, t 
V
out
 
V
dd
 
V
dd
/2 
t 
5 
 
where Rek is defined as the resistance of the path to the Vdd/Gnd node shared by node e and node k. Ck 
is the k
th
 node capacitance. ∆Vk is the voltage difference between Vdd/Gnd node and the initial voltage 
at the node k. ∆Vk = Vk(0) for falling voltages and ∆Vk = Vdd - Vk(0) for rising voltages. Similarly 
∆Ve is the voltage difference between Vdd/Gnd node and the initial voltage at the node e. ∆Ve = Ve(0) 
for falling voltages and ∆Ve = Vdd – Ve(0) for rising voltages. When all of the node voltages start at 
the same voltage then the last voltage fraction drops out.  For example consider the following RC 
network shown in Figure 1.3 and assume that all of the node voltages start at the same voltage. 
 
 
 
 
Figure 1.3: Example RC Circuit for Elmore Delay 
The Elmore delay at node 3 is given as td3 = R1C1 + (R1+R2) C2 + (R1+R2+R3) C3 
 
Power: The usual definition of power dissipation is the amount of heat energy dissipated in unit time. 
The average power dissipation Pavg is defined as shown in Eq. (1.2) 
                      (1.2) 
where IVdd(t) is the power supply current and Vdd is the supply voltage. 
 The power dissipative components in CMOS circuits consist of off-state leakage power, 
dynamic power due to charging and discharging of node capacitances, short-circuit power, switching 
power due to parasitic capacitances and glitch power due to unequal arrival of signals. Estimating the 
power dissipation with accurate models is difficult and computationally expensive. 

T
ddVddavg dtVtI
T
P
0
)(
1
6 
 
CHAPTER II 
 
 
REVIEW OF LITERATURE 
2.1 Review of device models 
Prior to late 1960s, performance estimation techniques were computationally simple as the 
number of transistors on a single chip were small and used physical models based upon the 
approximate modeling of the physical phenomena within a transistor. As the number of 
transistors increases on a single chip, the complexity of the circuit also increases making the 
physical models inadequate for quantitative analysis as they are computationally expensive. A 
circuit simulator, like SPICE [1], handles the complex nonlinear physical models of the transistor 
and solves the whole circuit as a big matrix to get the node outputs. Usually no more than a few 
thousand transistors may be simulated in a reasonable amount of computation time. Initial 
attempts to create transistor models for the large circuit simulation used empirical models to 
estimate the delay and power. These models are entirely based upon curve fitting, using whatever 
functions and parameters that best fit the measured data.  The use of empirical models limits the 
type of circuits that can be modeled.  The disadvantages of using empirical models are lack of 
error control in the resulting models and difficulty in relating the performance values back to the 
circuit elements. To solve this problem, a combination of physical and empirical models is used 
to analyze the circuit’s performance. In the Shockley square law model [6] the drain current ID is 
expressed as shown in Eq. 2.1. 
7 
 
 
 
 
Where VDSsat=VGS -VTH is drain saturation voltage and VTH is threshold voltage. K is a drivability 
factor and is given by 𝜇(
𝜖𝑜𝑥
𝑡𝑜𝑥
)(
𝑊
𝐿
), where 𝜇 denotes an effective mobility, 𝜖𝑜𝑥 a dielectric constant 
of a gate oxide, 𝑡𝑜𝑥 a gate oxide thickness, W a channel width and L channel length. In the alpha 
power law model [7], which is an extension of Shockley’s square law model in the saturation 
region the drain current in the saturation region is modeled as shown in Eq. 2.2.  
 
It includes the velocity saturation effects to predict the circuit behavior in the sub-micrometer 
regime which the original Shockley square law model does not take into account in case of deep 
submicron processes. The value of α is calculated directly from the measured data and usually 
lies between 2 and 1. For deep submicron processes the value of α is close to 1. By using this 
nonlinear device model the circuit simulation is computationally expensive and there is no closed 
form expression for determining the circuit delay which requires the waveforms for the input and 
output. 
To further simplify the analysis problem and to improve the simulation speed, all non-linear 
elements were approximated by appropriate linear elements by performing small signal analysis. 
In the Switched-Resistor model used in the IRSIM circuit simulator [8], the transistor is modeled 
as a voltage controlled switch connected to a series resistance as shown in Figure 2.1. 
 
 









regionSaturationVVVVVVK
regionOhmicVVVVVVVVK
regionCutoffVV
I
DSsatDSTHGSTHGS
DSsatDSTHGSDSDSTHGS
THGS
D
,,)(5.0
,},5.0){(
,0
2
2
  (2.1) 
regionSaturationVVVVVVI DSsatDSTHGSTHGSD  ,,)(
        (2.2) 
8 
 
 
 
 
 
Figure 2.1: Switched-Resistor Transistor Model 
Hence a MOS circuit can be considered as an RC network and the delay through a circuit path 
can be computed by using the simple expression given by Elmore in Eq. (1.1). By modeling the 
transistor as a switched-resistor the delay and power analysis of a circuit can be done with much 
less computational burden and circuits with several hundreds of thousands of transistors can be 
simulated in a reasonable amount of time. Though this model is less accurate, as it approximates a 
non-linear transistor as a linear resistor but is very helpful in doing the circuit analysis with less 
computational burden and to quickly estimate the delay without needing to determine the input 
and output waveforms.  
Reference [9] discusses a piecewise linear device model which is a modified switched resistor 
model with better accuracy. In this model the transistor is approximated as an open switch when 
the transistor is in the cutoff region, as a linear resistor when operating in the ohmic region and as 
a current source that is a function of input voltage when operating in the saturation region. 
 
 
 
Figure 2.2: Piecewise Linear Model 
9 
 
The drain current ID using this model is given by Eq. 2.3. 
 
 
This model also includes the input slope, effects due to short circuit current and velocity 
saturation. It also uses a linearized BSIM3 capacitance model. As the above includes all the major 
physical phenomena that are important in submicron and deep submicron regimes it is more 
accurate than the above discussed models but using this model is still computationally expensive 
and it also cannot produce a closed form expression to determine the circuit delay and hence 
requires input and output waveforms. 
Hence using RC device models will be more useful for fast circuit analysis and efficient circuit 
optimization as the circuit delay can be defined as a closed form expression without needing the 
exact input and output waveforms. 
2.2 Review of the Method of Logical Effort 
Estimating the circuit delay early in the design process is very useful and can save a lot of time in 
designing a circuit meeting the required design specifications. As discussed earlier in section 2.1, 
using complex device models would make the delay estimation very difficult as there is no closed 
form solution for estimating the delay easily and quickly hence requiring the circuit layout 
simulation to be done to achieve the input and output waveforms to be able to estimate the delay. 
With this approach the time to successfully design a specific circuit would take very long to meet 
the required design specifications. The method of logical effort [10] is an easy way to estimate 
delay in CMOS circuits without requiring to do a tedious layout simulation and also allows early 
modifications to the circuit design to achieve the greatest speed or to meet any delay constraints 
by comparing delay estimates of different logic structures. 









regionSaturationVVVVVVG
regionOhmicVVVVVGa
regionCutoffVV
I
DSsatDSTHGSTHGSm
DSsatDSTHGSDS
THGS
D
,),.(
,,..
,0
  (2.3) 
10 
 
The delay incurred by a logic gate is comprised of two components, a fixed part called the 
parasitic delay p and a part that is proportional to the load on the gate’s output, called the effort 
delay or stage effort f. The total delay d measured in units of τ, is the sum of the effort and 
parasitic delays and is given by Eq. 2.4 
 d = f + p              (2.4) 
where τ is the delay of an inverter driving an identical inverter with no parasitics. The effort delay 
is given as f=gh where g is the logical effort and h is the electrical effort given by Cout/Cin . Hence 
the delay through a single logic gate is  
 d=gh+p              (2.5) 
Logical effort is defined so that an inverter has a logical effort of 1. The logical effort of any other 
logic gate tells how much worse it is at producing output current than is an inverter, given that 
each of its inputs may present only the same input capacitance as the inverter. The table 2.1 
below shows the logical effort values for different logic gates with different inputs. 
 Number of inputs 
Gate Type 1 2 3 4 5 n 
Inverter 1      
NAND  4/3 5/3 6/3 7/3 (n+2)/3 
NOR  5/3 7/3 9/3 11/3 (2n+1)/3 
Multiplexer  2 2 2 2 2 
Table 2.1: Logical effort for inputs of static CMOS gates, assuming γ = 2 [6] 
where γ is the ratio of an inverter’s pull-up transistor width to pull-down transistor width. 
The parasitic delay of a logic gate is fixed and is given as multiples of the parasitic delay of an 
inverter denoted as pinv and is typically 1.0 delay units. The table 2.2 shows crude estimates of 
11 
 
parasitic delay for a few logic gates. Though these not very accurate but is convenient for hand 
analysis. 
Gate Type Parasitic Delay 
Inverter pinv 
n-input NAND npinv 
n-input NOR npinv 
n-way multiplexer 2npinv 
Table 2.2: Parasitic delay estimates of different logic gates 
Let’s look at how the delay of the logic gate can be estimated using the method of logical effort. 
For example consider a fanout-of-4 (FO4) inverter as shown in figure 2.3 
 
 
Figure 2.3: FO4 – An inverter driving four identical inverters 
Because each inverter is identical, Cout=4Cin , so h=4. Since the logical effort g of an inverter is 1, 
the effort delay f=gh=1x4=4. The parasitic delay for an inverter is p=pinv=1. Hence the delay of 
an inverter is given by d=f+p= 4+1 = 5.0 delay units. 
The method of logical effort can be extended to multistage logical networks and it also can help 
reveal the best number of stages in a multistage network to obtain the least overall delay. The 
path logical effort G is given by G=Пgi and the path electrical effort H is given by H=Cout/Cin 
where Cin and Cout refer to the input and output capacitances of the path. In the case of estimating 
the path delay, the branching effort b at the output of a logic gate needs to be taken into 
consideration to account for the fanout within a network. Hence the path branching effort B is 
12 
 
given as B= Пbi. Now the path effort delay F can be defined as F=GBH and the path parasitic 
delay P is defined as P=∑pi . Having known all the terms, the path delay is given as 
 D=F+P              (2.6) 
In case of an N-stage logic network, the path delay is found to be the least when each stage in the 
path bears the same stage effort. i.e fopt= gihi = F
1/N
 . Hence the minimum delay achievable along 
an N-stage logic network path is 
 Dopt=NF
1/N
+P              (2.7) 
Let’s look at how the path delay in a multistage network can be estimated using the method of 
logical effort. For example consider the circuit as shown in figure 2.4 
 
 
Figure 2.4: Logic network consisting of three two-input NAND gates 
From the above discussion on multistage logic networks, to computer the path delay from A to B 
for the above circuit shown in figure 2.4 
G = Пgi = (4/3)
3
; H = 8C/C = 8; B = Пbi = 1; F = GBH = 18.96; P = ∑pi =3 (2pinv) = 6 
Hence least path delay is Dopt = NF
1/N
+P = 3(18.96)
1/3
 + 6 = 14.0 delay units. Now let’s look at 
how the transistors to be sized along the path to achieve this least delay. Since fopt = 18.96
1/3
 = 
8/3, now g3h3 = 8/3; h3 = 8/3g3; 8C/z = 8/3g3; z = 8C(g3)(8/3); z = 8C(4/3)(8/3) = 4C. Similarly, 
y = z(4/3)(8/3) = 2C. 
Though the method of logical effort is simple and effective in quickly estimating the circuit delay, 
it allows only fixed p transistor width to the n transistor width ratios γ when determining the 
A C 
B 
8C 
y 
z 
13 
 
transistor sizes while a more efficient design is possible when the p and n transistor widths can be 
chosen irrespective of the other. Also, when using the method of logical effort, there is no easy 
way to include the interconnect wire capacitance in estimating the circuit delay and hence makes 
it a less accurate method. 
In this dissertation we present a simplified RC delay model that can easily take into account the 
interconnect wire capacitance in estimating the path delays and also allows the choice of p and n 
transistor widths irrespective of the other to be able to design a more efficient circuit. 
2.3 Review of the gradient descent based optimization 
Optimization is useful in achieving the minimum or maximum of an objective function (OF) by 
determining optimum values for the decision variables. In most cases it is required to minimize 
the function. When the OF is continuous w.r.t its decision variables, gradient based methods are 
very useful as the method tells clearly which direction to go in order to find a better OF value 
faster. In case of gradient descent method, one should take steps proportional to the negative of 
the gradient of the function at the current point to obtain a minimum OF value. Gradient descent 
method is useful in finding the local minimum of the function but does not guarantee that the 
solution found is a global optimum except when the OF is a convex function. When the OF is not 
continuous, the gradient of the function cannot be determined at all the values of the decision 
variables and hence direct search methods like particle swarm, leapfrogging can be used to 
optimize the function without requiring to evaluate the gradient of the function. Since the 
objective function in our analysis is continuous, gradient descent method can be used for 
optimization. 
Let’s look into the details of how the gradient descent method is implemented in general. Let F(x) 
be a multivariable objective function that is differentiable in the neighborhood of any given point 
14 
 
xn in the decision variable space then the new point xn+1 satisfying F(xn) >= F(xn+1) in the 
direction of negative gradient of F(x) for a small value of δn is defined as 
 xn+1 = xn – δn (F’(xn))             (2.8) 
It takes several iterations to obtain the best possible minimum value for F(x), and the value of δn 
can change in each iteration. The best value of δn can be chosen via a line search which is again 
an iterative process using the gradient descent method to obtain the optimum value for δn = δn
*
 
that minimizes F(xn) in the negative direction of gradient of F(xn). Hence the Eq. 2.8 is now 
defined as 
 xn+1 = xn – δn
*
 (F’(xn))             (2.9) 
The Eq. 2.9 is the Cauchy method [11] popularly known as the steepest descent method. The 
convergence of the algorithm is decided using a small value ε usually in the order of 10-5 such 
that F’(xn) <= ε and/or Δx = xn+1 – xn <= ε. Most often a fixed small value of δn = δ is assumed in 
each iteration to reduce the convergence time. Though this method is simple and has the optimal 
property of finding the best minimum OF value it performs poorly in terms of convergence and 
may not converge in a reasonable amount of time when the OF is bumpy and has v-shaped 
minima, which is the case when minimizing the circuit delay in our analysis. 
Reference [12] presented formulae to calculate the step-size δn in each iteration instead of doing a 
line search that helped in much faster convergence of the conventional steepest descent algorithm.  
 𝛿𝑛 =
𝑠𝑛−1
𝑇 𝑦𝑛−1
||𝑦𝑛−1||2
2             (2.10) 
or 
 𝛿𝑛 =
||𝑠𝑛−1||2
2
𝑠𝑛−1
𝑇 𝑦𝑛−1
            (2.11) 
15 
 
where sn-1 = xn – xn-1 and yn-1 = F’(xn) – F’(xn-1). Though this method is better than Cauchy 
method, it still takes significant amount of time to converge when the OF surface is bumpy and 
has v-shaped minima. Hence there is a need for a better and faster way of implementing the 
gradient descent algorithm to achieve faster convergence when the OF surface is bumpy and has 
v-shaped minima with satisfactory results compared to the above discussed methods. 
2.4 Review of optimization techniques based on transistor sizing 
Literature suggests that there are several techniques that are used in improving circuit 
performance. One such technique is circuit optimization. Again circuit optimization can be done 
using various methods such as transistor sizing, transistor reordering, transistor tapering etc. 
However, transistor sizing is considered to be the simplest and most effective method in CMOS 
circuit optimization. Previous research shows many such attempts were made to optimize the 
circuit for speed and power using transistor sizing. Some have used RC transistor models to 
approximate the circuit delay/power, while some have used non-RC models to approximate the 
circuit delay/power. Let’s review some of them below. 
One of the early tools used for circuit speed optimization based on transistor sizing is discussed in 
[13] and is called SLOP (Switch Level Optimization) which uses an RC tree approximation for 
the circuit to estimate the circuit path delay. The total delay of path i including the transistor is 
given by the sum of all the delay contributions from each transistor in that path. 
di=t1+t2+…….+tj+…….+tn                (2.8) 
To minimize di w.r.t the transistor width Wj, evaluate 
𝜕𝑑𝑖
𝜕𝑊𝑗
= 0 and solve for Wj* which is the 
minimum delay contribution width of transistor j in path i. The critical path delay dc = (di)max 
determines the speed of the circuit. The optimization is done in three main stages: 
16 
 
a) Initialization – to record the initial delays of di and find the critical delay by the first 
simulation 
b) Global test – to establish delay, gain and device matrices. Delay matrix is established by 
each time changing one transistor width with one step, simulating and recording delays of 
each path. The gain matrix is established by calculating the partial derivatives of di with 
respect to the transistor width Wj. The device matrix is established by recording device 
number with non-zero gain. 
c) Critical path test – to only test devices in the critical delay column. 
Optimization done by this method is complicated and time consuming as it requires to change 
only one transistor width at a time and determine the delays of each path. Proper path balancing 
may not be achieved with this method hence there will be a problem of glitching. 
[14] also uses an RC model to estimate the delay but the transistor sizing is done to minimize the 
area subject to a delay constraint. In this the optimization problem is considered as a convex 
programming problem. Convex optimization is considered to be efficient because any local 
minimum solution will also be the global minimum solution. In this case the minimization is done 
on one path at a time but there will be a need to minimize multiple path delays at a time in most 
of the cases which is when the optimization problem becomes non-convex and there will be no 
simple solution for the optimization problem.  
[15] uses a different approach in using the RC model for the circuit area/power optimization. It 
tries to place large size transistors and route to meet the delay specification of the circuit and then 
the interconnect wire length is extracted. The circuit is optimized for area/power subject to the 
delay constraint and as a result the transistor sizes decrease creating spaces in the layout and the 
interconnect wires can be re-routed utilizing these spaces in such a way to reduce the overall wire 
17 
 
length which in turn reduces the wire capacitance. In this the optimization is achieved majorly by 
minimizing the interconnect wire length. 
[16] also uses an RC model and the transistor sizing is done for optimizing the power-delay 
product rather than only delay or power of a CMOS circuit. The delay calculations are done using 
Elmore’s RC delay model. A loading coefficient α=CL/Cd is considered as the decision variable 
for the power-delay product (PDP) objective function optimization. CL is the load capacitance at 
a node and Cd is the circuit internal capacitance at that node. The PDP is expressed in terms of α 
and the optimum αopt can be determined from solving the expression 
𝜕𝑃𝐷𝑃
𝜕∝
= 0. As the model is 
not explicitly expressed in terms of transistor widths it is difficult to analyze the effects of 
changing the transistor widths on the power delay product. 
[17] also uses RC Elmore delay model. In this both the transistor widths and lengths are used as 
decision variables for the optimization problem. The optimization problem was solved as a multi-
objective optimization. The transistor widths and lengths were modified to match all the circuit 
path delays with the critical path delay to achieve path balancing to avoid glitching and at the 
same time minimize circuit power consumption. In achieving optimum results the transistor 
lengths have to be increased, which results in both increased gate capacitances and area. To 
reduce this negative influence of the increased transistor lengths, two alternate ways were 
proposed: twin transistors and merged transistors. In order to equalize all the path delays w.r.t the 
critical path, every path requires individual optimization. In this the delay is not essentially 
minimized but is made equal to the critical path delay for all the circuit paths to achieve path 
balancing.  
[18],[19],[20],[21] uses complex and non-linear models to minimize delay and power. Though 
these models are accurate compared to simple RC models, circuit optimization for speed and 
18 
 
power using these models is difficult and is not very efficient because of the complexity of the 
models.  
[22],[23] discusses different techniques like transistor reordering and transistor tapering to 
optimize the circuit for delay and power, however we prefer to stick with transistor sizing 
techniques as it is the simple and effective to optimize the circuit for speed and power. 
2.5 Review of parallel prefix adders and their performance 
Binary addition is one of the most often used arithmetic operations on microprocessors. A large 
variety of algorithms and implementations have been proposed for binary addition. When high 
operation speed is of great importance, parallel-prefix adders such as Brent-Kung[2], 
Skylansky[3] and Kogge-Stone[4] adders are commonly used. Any decrease in the delay will 
directly relate to an increase in the adder throughput. The primary requirements for any adder are 
that it should be fast and efficient in terms of power consumption and chip area. When the adder 
size is small, the above mentioned parallel prefix adders show similar performance in terms of 
delay and power but as the adder size gets large (N>16 bit), the difference in their performance 
becomes significant due to the difference in their implementation. 
The Skylansky adder presents a least depth prefix network at the cost of increased fan-out at 
certain nodes. The fan-out increases exponentially as the adder size increases. The Kogge-Stone 
adder has optimal depth and low fan-out but produces massively complex circuit and also 
accounts for a large number of interconnects when the adder size is big. The Brent-Kung adder 
has the advantage of minimal number of circuit nodes, which yields in reduced area but this 
implementation requires maximum depth that causes an increase in the latency when compared to 
the other structures. The above analysis is mostly true when fixed transistor sizes or minimum 
sized transistors are used. Hence it is useful to learn the performance of these adders for different 
adder sizes with optimum transistor widths to allow a better choice between these adders in the 
19 
 
design of high speed microprocessors. Since the process of designing the adders and their 
optimization is a time taking process, it would be very useful to use a simple device model like 
RC transistor model and estimate the delay and power without requiring to draw the layout. 
[24] presents the design and performance comparison of various high-speed adders using CMOS 
and Transmission gate technology. This work showed that the adders designed with transmission 
gate has much less power and delay than those with CMOS gates. But the focus of this 
dissertation is the design with CMOS gates. This work requires the adder layout/schematic to be 
done to do the analysis. Moreover, there is no information on optimum transistor widths being 
used in their study of adder comparison. 
[25] presents the design and performance of any parallel prefix adders by using alternating odd 
and even cells by eliminating unwanted buffers between the cells from two stages in the 
conventional design. Similar approach will be used in this dissertation to eliminate the unwanted 
buffers in the adder design. Even this approach requires the adder layout/schematic to be done to 
do the analysis and does not discuss the optimization of the adders for optimum delay or power. 
[26] also presents a performance comparison of various high-speed adders using Xylinx ISE for 
simulation and synthesis without discussing the optimization of the adders for improved 
performance. 
[27] presents a comparison of adder performance with radix-4 and radix-2 in case of a 32 bit 
parallel prefix adder. This work shows that the adder implementation with radix-4, Sparse-4 has 
reduced delay with a minor increase in power than with radix-2 implementation. Also, this 
method requires the adder analysis using the layout/schematic and does not discuss the use 
optimum transistor widths in the design. 
This dissertation presents the adder analysis without having to do the tedious layout/schematic 
simulation by using a simplified RC delay model and also presents a performance comparison of 
20 
 
different high-speed adders with optimum transistor widths for different delay and power 
constraints. 
 
 
21 
 
CHAPTER III 
 
 
METHODOLOGY 
3.1 Simplified RC Delay Model 
The simplified RC delay model [28] consists of parasitic delay tdP that arises due the circuit’s 
internal parasitic resistances and capacitances and is estimated using Elmore delay technique, and 
also consists of effort delay tdF that arises due to the capacitive load on the circuit’s output node. 
Hence the delay (td) of any circuit path is given by 
 td = tdP + tdF              (3.1) 
 tdF = Rout Cload              (3.2) 
When there is no load on the output node i.e Cload=0 then td = tdP and when there is a load on the 
output node then td = tdP + tdF. The parasitic delay and output resistance Rout are determined by 
the topology of the circuit. 
To discuss the implementation of the simplified RC delay model in quickly estimating the circuit 
or gate delay, consider a 2-input NAND gate as shown in Figure 3.1 as an example circuit. The 
transistor level diagram is shown on the left and the corresponding stick diagram is shown on the 
right in Figure 3.1. The circuit has four transistors and two parasitic node capacitances CY and C1. 
CY is same as the Cout which is the parasitic capacitance on the output node Y.
22 
 
 
 
 
 
 
Figure 3.1: 2-input NAND gate a) Transistor diagram b) Stick diagram 
The parasitic delay and the output resistance of this circuit are determined for different delay 
paths based on Elmore delay calculations as shown in Table 3.1. 
 
Table 3.1: Parasitic delay expressions for 2-input NAND gate 
Where RpA, RnA, RpB, RnB are the channel resistances of the transistors. From the above table the 
delay paths are determined assuming only one input is changing at a time and other inputs are not 
switching. In the case of the 2-input NAND gate when one input is switching the other input has 
to be ‘l’ to be able to determine the delay of the path from the changing input to the output 
because if the non-switching input is ‘0’ there will be no change in the output even when the 
switching input is falling or rising, hence there is no question of delay in that case. 
23 
 
Estimation of channel resistance in terms of Unit R: 
The major goal of the simplified RC delay model is to see the effect of changing transistor widths 
without having to do detailed layout and simulation. The channel resistances of individual 
transistors are inversely proportional to the transistor channel widths and are proportional to the 
transistor channel lengths and is given by 
𝑅𝑋 = 𝑅𝑠
𝐿𝑋
𝑾𝑿
              (3.3) 
Where RX is the channel resistance of transistor X, Rs is the transistor sheet resistance which is a 
process constant, LX is the transistor channel length and WX is the transistor channel width. In our 
design the transistor channel length is chosen to be the minimum feature size of the process which 
is 𝐿𝑚𝑖𝑛. Hence the channel resistances for the nFET and pFET transistors are given as shown 
below. The subscripts ‘n’ and ‘p’ refers to the n-channel and p-channel respectively. 
𝑅𝑛𝑋 = 𝑅𝑠𝑛
𝐿𝑚𝑖𝑛
𝑾𝒏𝑿
              (3.4) 
𝑅𝑝𝑋 = 𝑅𝑠𝑝
𝐿𝑚𝑖𝑛
𝑾𝒑𝑿
              (3.5) 
To simplify calculations of channel resistance and make them process independent, lets define 
channel resistance as a multiple of R, the channel resistance of a minimum size nFET. 
𝑅 ≡ 𝑅𝑠𝑛
𝐿𝑚𝑖𝑛
𝑊𝑚𝑖𝑛
               (3.6) 
𝑅𝑛𝑋 = 𝑅𝑠𝑛
𝐿𝑚𝑖𝑛
𝑾𝒏𝑿
= 
𝑊𝑚𝑖𝑛
𝑾𝒏𝑿
𝑅             (3.7) 
𝑅𝑝𝑋 = 𝑅𝑠𝑝
𝐿𝑚𝑖𝑛
𝑾𝒑𝑿
= 
𝑊𝑚𝑖𝑛
𝑾𝒑𝑿
2𝑅             (3.8) 
24 
 
Note that there is an extra factor of 2 for pFET channel resistance to account for the significant 
difference between the n-channel and p-channel sheet resistance. For 0.18um technology R = 
5KΩ approximately. 
Estimation of parasitic capacitances in terms of Unit C: 
The transistor gate capacitance Cin per unit width W has stayed relatively constant if all the three 
dimensions of the gate scale by the same factor and is equal to 2fF/µm approximately. To 
simplify calculations of parasitic capacitance and make them process independent, lets define 
Unit Capacitance C given as 
𝐶 ≡ (
𝐶𝑖𝑛
𝑊
)𝑊𝑚𝑖𝑛              (3.9) 
For TSMC 0.18um technology C = 0.89fF approximately. 
Hence capacitance at node i in terms of Unit C is given as 
𝐶𝑖 ≡
𝐶𝑖
(
𝐶𝑖𝑛
𝑊
)𝑊𝑚𝑖𝑛
𝐶            (3.10) 
The internal parasitic capacitance at a node is separated into propagation channel capacitance 
(Cprop) and diffusion capacitance (Cdiff). 
The propagation channel capacitance Cprop for the n-channel transistor and in terms of transistor 
channel width is given as 
 
Figure 3.2: Channel Propagation Capacitance 
25 
 
𝐶𝑝𝑟𝑜𝑝 = {
(
𝐶𝑛𝑔𝑑𝑜
𝑃
)𝑊𝑛𝑋, 𝑜𝑓𝑓
[
1
2
(
𝐶𝑔
𝐴
)𝐿𝑚𝑖𝑛 + (
𝐶𝑛𝑔𝑑𝑜
𝑃
)]𝑊𝑛𝑋, 𝑜𝑛
        (3.11) 
Normalizing to the unit capacitance 
𝐶𝑝𝑟𝑜𝑝 =
{
 
 
 
 (
𝐶𝑛𝑔𝑑𝑜
𝑃
)𝑊𝑛𝑋
(
𝐶𝑖𝑛
𝑊
)𝑊𝑚𝑖𝑛
𝐶, 𝑜𝑓𝑓
[
1
2
(
𝐶𝑔
𝐴
)𝐿𝑚𝑖𝑛+(
𝐶𝑛𝑔𝑑𝑜
𝑃
)]𝑊𝑛𝑋
(
𝐶𝑖𝑛
𝑊
)𝑊𝑚𝑖𝑛
𝐶, 𝑜𝑛
         (3.12) 
The numbers for various processes are shown in Table 3.2. 
 
Table 3.2: Capacitance values for various processes 
From the above table the average value of Cngdo/Cin for various processes is about 0.25. Hence 
𝐶𝑝𝑟𝑜𝑝 ≈ {
1
4
𝑾𝒏𝑿
𝑊𝑚𝑖𝑛
𝐶, 𝑜𝑓𝑓
1
2
𝑾𝒏𝑿
𝑊𝑚𝑖𝑛
𝐶, 𝑜𝑛
          (3.13) 
The above expression makes the node capacitance different depending on whether the transistors 
are ON or OFF which greatly complicates analysis without significantly improving accuracy. To 
simplify the model Cprop is approximated as 𝐶𝑝𝑟𝑜𝑝 =
1
2
𝑾𝒏𝑿
𝑊𝑚𝑖𝑛
𝐶. Similarly for the p-channel  
𝐶𝑝𝑟𝑜𝑝 =
1
2
𝑾𝒑𝑿
𝑊𝑚𝑖𝑛
𝐶. 
26 
 
The diffusion capacitance (Cdiff) varies depending on a) if there is a diffusion contact b) if there is 
no diffusion contact between two transistors c) if there is a diffusion contact but is shared 
between two transistors. 
a) The diffusion capacitance of transistor A in terms of unit capacitance C when there is a 
diffusion contact for n-diffusion and p-diffusion is modeled as 
 
Figure 3.3: Diffusion Capacitance with diffusion contact 
𝐶𝑛𝑑𝑐𝑋 =
1
2
(
𝐾𝑾𝒏𝑨
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶           (3.14) 
𝐶𝑝𝑑𝑐𝑋 =
1
2
(
𝐾𝑾𝒑𝑨
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶           (3.15) 
Where K and K1 are process constants. 
b) The diffusion capacitance when there is no diffusion contact between two transistor A 
and B for n-diffusion and p-diffusion is modeled as 
 
Figure 3.4: Diffusion Capacitance with no diffusion contact 
27 
 
𝐶𝑛𝑑𝑛𝑐𝑋𝑌 =
1
2
(
𝐾(𝑾𝒏𝑨+ 𝑾𝒏𝑩)
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶         (3.16) 
𝐶𝑝𝑑𝑛𝑐𝑋𝑌 =
1
2
(
𝑲(𝑾𝒑𝑨+ 𝑾𝒑𝑩)
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶         (3.17) 
c) The diffusion capacitance when there is a diffusion contact but shared between two 
transistors A and B for n-diffusion and p-diffusion is modeled as 
 
Figure 3.5: Diffusion Capacitance with shared diffusion contact 
𝐶𝑛𝑑𝑠𝑐𝑋𝑌 =
1
2
(
𝑲(𝑾𝒏𝑨+ 𝑾𝒏𝑩)
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶         (3.18) 
𝐶𝑝𝑑𝑠𝑐𝑋𝑌 =
1
2
(
𝑲(𝑾𝒑𝑨+ 𝑾𝒑𝑩)
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶         (3.19) 
The diffusion capacitance when there is no diffusion contact and when there is a sharing contact 
between two transistors is modeled similarly as there is no significant difference within the 
accuracy of the model. 
In case of 2-input NAND gate, the parasitic node capacitance Cout and C1 is calculated as shown 
below by using the above derived expressions for the Cprop and Cdiff 
𝐶𝑜𝑢𝑡 =
1
2
𝑾𝒏𝑩
𝑊𝑚𝑖𝑛
𝐶 + 
1
2
𝑾𝒑𝑨
𝑊𝑚𝑖𝑛
𝐶 + 
1
2
𝑾𝒑𝑩
𝑊𝑚𝑖𝑛
𝐶 + 
1
2
(
𝑲𝑾𝒏𝑩
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶 + 
1
2
(
𝑲(𝑾𝒑𝑨+𝑾𝒑𝑩)
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶  
          =
1
2
(
(1+𝐾)(𝑾𝒏𝑩+𝑾𝒑𝑨+𝑾𝒑𝑩)
𝑊𝑚𝑖𝑛
+ 2𝐾1)𝐶         (3.20) 
 
28 
 
𝐶1 =
1
2
𝑾𝒏𝑨
𝑊𝑚𝑖𝑛
𝐶 + 
1
2
𝑾𝒏𝑩
𝑊𝑚𝑖𝑛
𝐶 + 
1
2
(
𝑲(𝑾𝒏𝑨+𝑾𝒏𝑩)
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶  
      =
1
2
(
(1+𝐾)(𝑾𝒏𝑨+𝑾𝒏𝑩)
𝑊𝑚𝑖𝑛
+ 𝐾1)𝐶          (3.21) 
Similarly the channel resistances in case of 2-input NAND gate are given below 
𝑅𝑛𝐴 = 
𝑊𝑚𝑖𝑛
𝑾𝒏𝑨
𝑅                                𝑅𝑛𝐵 = 
𝑊𝑚𝑖𝑛
𝑾𝒏𝑩
𝑅           (3.22) 
𝑅𝑝𝐴 = 
𝑊𝑚𝑖𝑛
𝑾𝒑𝑨
𝑅                               𝑅𝑝𝐵 = 
𝑊𝑚𝑖𝑛
𝑾𝒑𝑩
2𝑅           (3.23) 
By substituting the channel resistances and the parasitic node capacitances in the parasitic delay 
equations for the 2-input NAND gate in table 3.1 the path delays can be expressed explicitly in 
terms of transistor widths. Hence the model is greatly useful for faster and easy estimation of 
circuit delay and can be effectively used for circuit speed and power optimization using transistor 
sizing.  
Similarly the input gate capacitance (Cin) and the interconnect wire capacitance (Cwire) can be 
modeled as shown below. 
𝐶𝑖𝑛 = (
𝑾𝒏𝑿+ 𝑾𝒑𝑿
𝑊𝑚𝑖𝑛
)𝐶                       (3.24) 
Modern deep submicron processes have a capacitance of about 0.2fF per micrometer of wire 
length. 
𝐶𝑤𝑖𝑟𝑒 = 0.2𝑓𝐹/𝜇𝑚. 𝐿𝑤𝑖𝑟𝑒          (3.25) 
Normalizing to the unit capacitance 
𝐶𝑤𝑖𝑟𝑒 =
0.2𝑓𝐹/𝜇𝑚.𝐿𝑤𝑖𝑟𝑒
(
𝐶𝑖𝑛
𝑊
).𝑊𝑚𝑖𝑛
𝐶 =
0.2𝑓𝐹/𝜇𝑚.𝐿𝑤𝑖𝑟𝑒
2𝑓𝐹/𝜇𝑚.𝑊𝑚𝑖𝑛
𝐶 =
1
10
𝐿𝑤𝑖𝑟𝑒
𝑊𝑚𝑖𝑛
𝐶       (3.26) 
29 
 
In our analysis the wire resistance is not considered. Designers should add inverter repeaters to 
make the wire resistance negligible compared to transistor channel resistance. For more detailed 
discussion on how the expressions for the channel resistances and parasitic capacitances are 
derived please refer to [28]. 
This delay model can be easily extended to larger circuits where multiple gates are connected 
together by including interconnect wire capacitance and the input gate capacitance. For example 
consider a 2x4 Decoder circuit (DEC2) with enable signal using 3-input NAND gates and 
inverters as shown in Figure 3.6. 
 
Figure 3.6: 2x4 Decoder circuit with enable signal 
It is assumed that each of the NAND gates is identical and each of the inverters is identical. 
Hence we only have to design one NAND gate and one inverter and use the same design for the 
others. 
 
 
Figure 3.7: a) 3-input NAND b) Inverter 
A 
B 
C 
Y A Y 
Cwire 
30 
 
Similar to the 2-input NAND gate we discussed earlier, determine the parasitic path delays to the 
output from each input for the 3-input NAND gate and the inverter circuit shown in Figure 3.7. 
Now the possible path parasitic delays in case of 2x4 Decoder circuit shown in Figure 3.6 are 
tdP(A0Y0r)DEC2 = tdP(AYf)NAND3+Rout(AYf)NAND3 [Cwire + Cin(AYr)INV] + tdP(AYr)INV                  (3.27) 
tdP(A0Y0f)DEC2 = tdP(AYr)NAND3+Rout(AYr)NAND3 [Cwire + Cin(AYf)INV] + tdP(AYf)INV      (3.28) 
tdP(A1Y0r)DEC2 = tdP(BYf)NAND3+Rout(BYf)NAND3 [Cwire + Cin(AYr)INV] + tdP(AYr)INV      (3.29) 
tdP(A1Y0f)DEC2 = tdP(BYr)NAND3+Rout(BYr)NAND3 [Cwire + Cin(AYf)INV] + tdP(AYf)INV      (3.30) 
tdP(ENY0r)DEC2 = tdP(CYf)NAND3+Rout(CYf)NAND3 [Cwire + Cin(AYr)INV] + tdP(AYr)INV      (3.31) 
tdP(ENY0f)DEC2 = tdP(CYr)NAND3+Rout(CYr)NAND3 [Cwire + Cin(AYf)INV] + tdP(AYf)INV      (3.32) 
To get the total path delay, add the corresponding effort delay Rout.Cload to each of the above 
expressions in Eq. 3.27 to Eq. 3.32. 
Since the simplified RC delay model is explicitly expressed in terms of transistor channel widths, 
it helps in faster estimation of the circuit delay of any CMOS circuit and also helps analyze the 
effects of changing transistor channel widths on circuit delay easily. Unlike the logical effort 
model discussed in Chapter 2, this model allows p and n channel widths selection to be 
independent of each other in designing an effective circuit in terms of delay and power and is also 
more accurate than the logical effort model as it takes into consideration the interconnect wire 
capacitance which plays a significant role in determining the circuit delay when the wire lengths 
are long. 
 
 
31 
 
3.2 Power Model 
Minimizing just the delay using the above discussed delay model would result in large transistors 
which will consume more power and may not use the silicon area efficiently. Hence, there should 
be a constraint on power when minimizing the delay. Before using the power as a constraint to 
find the optimum transistor sizes, we need a model for the power changes when changing 
transistor widths [28]. Fortunately almost all power consumption is proportional to the width of 
the transistor channels. The power dissipative components in CMOS circuits consist of off-state 
leakage power, dynamic power due to charging and discharging of node capacitances, short-
circuit power, switching power due to parasitic capacitances and glitch power due to unequal 
arrival of signals. But the major components are the leakage power, dynamic power and the short-
circuit power. Let us look at how these components are modeled as proportional to transistor 
channel widths. 
Off-state leakage power 
When transistors are turned off, there can be a small drain current (IDoff). Though this current is 
too small to have a considerable impact on the delay, it can have a significant impact on the 
power consumption. 
𝑃𝑙𝑒𝑎𝑘 = 𝑉𝑑𝑑𝐼𝐷𝑜𝑓𝑓 = 𝑉𝑑𝑑 (
𝐼𝐷𝑜𝑓𝑓
𝐼𝐷𝑜𝑛
) 𝐼𝐷𝑜𝑛          (3.33) 
Where IDon is the drain current when the transistor is on. If we approximate IDon with channel 
impedance 
𝐼𝐷𝑜𝑛 =
𝑉𝑑𝑑
𝑅𝑜𝑛
=
𝑉𝑑𝑑
𝑅
𝑊
𝑊𝑚𝑖𝑛
            (3.34) 
Where R is the unit resistance, W is the channel width, and Wmin is the minimum transistor width. 
Hence the leakage power is expressed in terms of transistor width as 
32 
 
𝑃𝑙𝑒𝑎𝑘 = 𝑉𝑑𝑑 (
𝐼𝐷𝑜𝑓𝑓
𝐼𝐷𝑜𝑛
) 𝐼𝐷𝑜𝑛 =
𝑉𝑑𝑑
2
𝑅
(
𝐼𝐷𝑜𝑓𝑓
𝐼𝐷𝑜𝑛
)
𝑊
𝑊𝑚𝑖𝑛
         (3.35) 
Dynamic power 
The dynamic power consumption for a single logic gate (neglecting Cwire) is given as 
𝑃𝑑𝑦𝑛_𝑔𝑎𝑡𝑒 = (𝐶𝑜𝑢𝑡 + 𝐶𝑙𝑜𝑎𝑑)𝑉𝑑𝑑
2𝑓          (3.36) 
Where Cout comes from the parasitic delay and Cload from the effort delay. Since each transistor 
appears as part of the parasitic delay in one logic gate and the effort delay in another logic gate, it 
follows that each transistor contributes capacitance both through Cout and Cload. Therefore, the 
dynamic power per transistor is approximately modeled as 
𝑃𝑑𝑦𝑛_𝑡𝑟𝑎𝑛 = [[(1 + 𝐾)
𝑊
𝑊𝑚𝑖𝑛
+ 𝐾1]𝐶 + 𝐶] 𝑉𝑑𝑑
2𝑓 = [(2 + 𝐾)𝐶]𝑉𝑑𝑑
2𝑓
𝑊
𝑊𝑚𝑖𝑛
+ 𝐾1𝐶𝑉𝑑𝑑
2𝑓     (3.37) 
Where C is the unit capacitance, f is the frequency of charging and discharging the node 
capacitance. 
Short circuit power 
Short circuit power is the only power dissipation component that cannot be modeled as 
proportional to transistor width. Fortunately, it is almost always negligible compared with other 
power dissipative components. Therefore, we do not need to include that in our optimization 
calculations. 
Since most of the power consumption per transistor is proportional to the transistor channel 
width, the total power consumption of a CMOS circuit can be approximated as proportional to the 
sum of all the transistor widths in the circuit.  
 
33 
 
3.3 Optimization Methodology 
In order to effectively optimize the circuit for speed and power by determining the optimum 
transistor channel width sizes, a heuristic approach based on gradient descent method is used. As 
discussed in Chapter 2, the gradient descent, also known as steepest descent, is a first order 
optimization algorithm used to find a local minimum by taking steps proportional to the negative 
of the gradient of the objective function (OF) at the current point in the decision variable (DV) 
space. 
 𝑋𝑖+1 = 𝑋𝑖 − 𝛿[∇𝑂𝐹/|∇𝑂𝐹|]          (3.38) 
Where δ is the step size. 
In our optimization analysis, each transistor width (Wi) is bounded between a minimum (Wmin) 
and a maximum (Wmax) to avoid very large transistor widths and the decision variable space is 
normalized so that each transistor width after normalization varies from 0 to 1. 
minmax
min
WW
WW
W inorm


            (3.39) 
When minimizing power, the OF is considered as the sum of the transistor widths in the circuit as 
discussed in the power model. 
iWOF              (3.40) 
When minimizing delay, the OF is the maximum of all possible critical path delays. 
}_max{ delayspathOF            (3.41) 
In this case, the minimum of the OF lies at the intersection of two delay paths and looks like a 
surface as shown in the figure below. 
34 
 
  
  
 
Figure 3.8: OF surface when minimizing delay 
 Since the minimum of the OF is V-shaped, it is very difficult to find a point at which the gradient 
is zero or very small to determine the convergence. Hence the convergence of the optimization 
algorithm is achieved when the step size falls below a small value ε usually in the order of 10-5 or 
when the gradient of the OF w.r.t the decision variables is less than or equal to ε. The heuristic 
approach used in order to achieve faster and effective convergence is discussed below in steps. 
1. Choose an initial step size δ = δint and start the optimization using gradient descent 
method 
2. While tracking the best OF point in the DV space found in each iteration, if the algorithm 
did not find a better OF value in n iterations from the current iteration start the 
optimization from the previously found best point with a reduced value of δ i.e. δ = δ/2 
3. To speed up the convergence of the optimization problem, when the value of δ is less 
than δ1 but greater than δ2 such that δint > δ1 > δ > δ2 reduce the number of iterations n to 
n1 required to achieve the better OF value. When the δ value is less than δ2 such that δ1 > 
δ2 > δ > ε further reduce the number of iterations n1 to n2 required to achieve a better OF 
value. 
To achieve satisfactory results without effecting the optimization results much, but to 
significantly reduce the convergence time choose n1 = n/2, n2 = n/10. δ1, δ2 can be chosen 
depending on the value of δint. For example, when n=50 then n1=25, n2=5 and when δint=0.5 then 
δ1=0.1, δ2=0.01. 
OF 
W 
Min of OF 
35 
 
The circuit optimization is done to minimize the circuit delay with power constraint and to 
minimize the circuit power with delay constraint. The optimization algorithm is defined in such a 
way, when minimizing the delay with power constraint, at the current point in DV space if the 
power constraint is met then the OF is circuit delay otherwise the OF will be the circuit power. 
Similarly when minimizing the power with delay constraint, at the current point in DV space if 
the delay constraint is met then the OF is the circuit power otherwise the OF will be the circuit 
delay.  
This heuristic approach helps to optimize the circuit much faster than the conventional methods 
discussed in [11] and [12] while converging to a satisfactory solution to obtain the optimum 
transistor widths for the circuit optimization involving the multiple delay paths and hundreds of 
transistor widths. 
3.4 Model Validation 
To validate the simplified RC delay model a simple four inverter chain circuit is considered. The 
model is tested against SPICE and Logical Effort(LE) in the case of the circuit with lightly loaded 
condition, i.e., with no interconnect wire capacitance, Cwire and small load capacitance, Cload and 
also in case of the circuit with heavily loaded condition, i.e., with large Cwire and Cload. 
The analysis is done for 0.18 micrometer process technology. In case of the model for 0.18 
micrometer process the values for K and K1 in equations (3.14) - (3.19) is equal to one i.e. 
K=K1=1. The model is equivalent to the logical effort method when K1=0 and Cwire is ignored. 
3.4.1 Lightly loaded case 
Consider the circuit of four inverter chain as shown in the Figure 3.9 with no interconnect wire 
capacitance and a small load capacitance on the output node. 
36 
 
 
 
 
Figure 3.9: Four inverter chain with a light load and no Cwire. C=0.89fF for 0.18µm process 
It is assumed that the first inverter gate transistor widths are kept fixed. The p-transistor width is 
fixed at 1.414Wmin and the n-transistor width is fixed at Wmin. For the remaining inverter gates all 
the transistor widths are set at Wmin initially, where Wmin = 0.36 micrometers for a 0.18 
micrometer process. The circuit is then optimized with different power constraints using the 
simplified RC delay model and the results are shown in the plot below. 
 
Figure 3.10: Power-delay plot for lightly loaded case 
From the above plot, in case of lightly loaded case the optimum transistor widths found by the 
model agrees well with SPICE results in the region with tight power constraints but did not agree 
well when the power constraint is kept loose. Also, it is obvious that the simplified RC delay 
model predicts delays much more accurately than the logical effort method. 
The above four inverter chain can be solved in closed form for minimum delay to determine the 
optimum transistor sizes as discussed in [28] and the optimum value of transistor size ratio is 
found to be 𝑍𝑖 = √2; where 𝑍𝑖 = 𝑊𝑝𝑖 𝑊𝑛𝑖⁄  and i represents the number of the inverter in the 
0
20
40
60
80
100
120
140
160
180
3 3.5 4 4.5 5 5.5
D
el
ay
 (
p
se
c)
 
Power = Wsum (micrometers) 
SPICE_Model
Model
SPICE_LE
LE
2.414C 
37 
 
inverter chain. The optimum widths found by the simplified RC delay model for minimum circuit 
delay in the lightly loaded case also agrees with the closed form solution i.e 𝑍𝑖 = √2 
3.4.2 Heavily loaded case 
Consider the circuit of four inverter chain as shown in Figure 3.11 with large interconnect wire 
capacitance and a large load capacitance on the output node. 
 
 
Figure 3.11 Four inverter chain with a heavy load and large Cwire. C=0.89fF for 0.18µm process 
Similar to the lightly loaded case it is assumed that the first inverter gate transistor widths are 
kept fixed, p-transistor width is fixed at 1.414Wmin and the n-transistor width is fixed at Wmin. For 
the remaining inverter gates all the transistor widths are set at Wmin initially, where Wmin = 0.36 
micrometers for 0.18 micrometer process. The circuit is then optimized with different power 
constraints using the simplified RC delay model and the results are shown in the plot below. 
 
Figure 3.12 Power-delay plot for heavily loaded case 
0
50
100
150
200
250
300
350
400
450
3 5 7 9 11
D
el
ay
 (
p
se
c)
 
Power = Wsum (micrometers) 
SPICE_Model
Model
SPICE_LE
LE
10C 10C 10C 10C 
38 
 
From the above plot, in case of heavily loaded case the optimum transistor widths found by the 
model agrees well with SPICE results. Again, it is obvious that the simplified RC delay model 
predicts delay much accurately than the logical effort method. 
The above SPICE results are the values when the optimum transistor widths found from 
optimizing the inverter chain using the model and the logical effort method are used in SPICE 
simulation to justify the widths found are actually reducing the SPICE delay. 
Also, a comparison of CPU computation time taken to estimate the delay in case of cascaded 
inverter gates is done and is shown in Table 3.3. 
 CPU Time (seconds) 
SPICE IRSIM Model 
INV 0.022 0.001 < 10^-3 
INV2 0.022 0.001 < 10^-3 
INV4 0.024 0.001 < 10^-3 
INV8 0.029 0.001 < 10^-3 
INV16 0.102 0.001 < 10^-3 
INV32 0.188 0.002 < 10^-3 
INV64 0.411 0.003 < 10^-3 
INV128 0.953 0.005 < 10^-3 
INV256 1.582 0.008 < 10^-3 
Table 3.3: CPU time comparison 
In the above table the subscript beside the term INV refers to the number of cascaded inverter 
gates. As you can see from the table 3.3, SPICE takes significant amount of computation time for 
circuits with just couple of hundreds of transistors, while the IRSIM and the Model did not take 
much computation time compared to SPICE.  Overall the model performed well in quickly 
estimating the delay of the circuit. 
3.5 Convexity of the objective function 
Reference [29] says a real function f is convex on an interval [a, b] if for any two points x1 and x2 
in [a, b] and any λ where 0 < λ < 1, 
39 
 
𝑓[𝜆𝑥1 + (1 − 𝜆)𝑥2] ≤ 𝜆𝑓(𝑥1) + (1 − 𝜆)𝑓(𝑥2)          (3.42) 
In our optimization using transistor sizing, each transistor width is bounded between a minimum 
and a maximum i.e. 𝑊𝑖 ∈ [𝑊𝑚𝑖𝑛,𝑊𝑚𝑎𝑥]. The parasitic channel resistance, parasitic channel and 
diffusion capacitances, wire capacitance terms discussed in Section II are all convex on the 
interval [𝑊𝑚𝑖𝑛,𝑊𝑚𝑎𝑥] as they all obey the above condition for the convex function. It is also 
observed that the product of the channel resistance and the parasitic capacitance is also convex on 
the interval [𝑊𝑚𝑖𝑛,𝑊𝑚𝑎𝑥]. From the properties of the convex functions, if two functions are 
convex then the sum of two convex functions is also convex. Hence the path delay expressions 
which are just the sum of the convex functions is also convex over the interval [𝑊𝑚𝑖𝑛,𝑊𝑚𝑎𝑥]. 
When minimizing the delay, the objective function is the maximum of the possible critical path 
delays. Again, from the properties of convex functions, if two functions are convex then the 
maximum of the two convex functions is also convex. Since each path delay is convex on the 
interval [𝑊𝑚𝑖𝑛,𝑊𝑚𝑎𝑥], the maximum of the path delays is also convex on the 
interval [𝑊𝑚𝑖𝑛,𝑊𝑚𝑎𝑥]. 
When minimizing the power, the objective function is the sum of the transistor widths which is 
also convex on the interval [𝑊𝑚𝑖𝑛,𝑊𝑚𝑎𝑥] as per the definition of the convex function. 
 
40 
 
CHAPTER IV 
 
 
PERFORMANCE EVALUATION & OPTIMIZATION RESULTS 
4.1 Parallel prefix adders 
Addition forms the basis for many processing operations. As a result, adder circuits are of great 
interest to digital system designers. Many adder architectures serve different speed and area 
requirements. The focus of this dissertation is to model the popular high speed parallel prefix 
adders such as Brent-Kung, Skylanksy and Kogge-Stone using the simplified RC delay model 
and compare their performance w.r.t speed and power by optimizing with different performance 
constraints. 
These adders perform the addition operation based on carry generation and propagation logic. 
The expressions to describe whether a group spanning bits i…j, inclusive, generate or propagate a 
carry are given as shown 
 Gi:j = Gi:k + Pi:k · Gk-1:j             (4.1) 
 Pi:j = Pi:k · Pk-1:j              (4.2) 
With the base case 
 Gi:i = Ai · Bi              (4.3) 
 Pi:i = Pi = Ai ⊕ Bi             (4.4) 
 
41 
 
For an N bit adder size, the Brent-Kung adder computes the carry generate and propagate prefixes 
for 2-bit groups. These are used to find prefixes for 4-bit groups, which in turn are used to find 
prefixes for 8-bit groups, and so forth. The prefixes then fan back down to compute the carries-in 
to each bit. The adder requires 2(log2N) – 1 stages. The fanout is limited to 2 at each stage. The 
Skylansky adder reduces the delay log2N stages by computing intermediate prefixes along with 
the large group prefixes. This comes at the expense of fanouts that double at each level. These 
high fanouts cause poor performance on wide adders unless the gates are appropriately sized. The 
Kogge-Stone adder achieves both log2N stages and fanout of 2 at each stage. This comes at the 
cost of many long wires that must be routed between stages. The adder also contains more PG 
cells; while this may not impact the area if the adder layout is on a regular grid, it will increase 
the power consumption. Despite these costs, the Kogge-Stone adder is widely used in high-
performance 32-bit and 64-bit adders. This dissertation presents power-delay performance 
comparison for these adders in case of 32bit, 64bit and 128bit adder sizes. For simplicity 16-bit 
adder structures are presented as shown in the figures Figure 4.1, Figure 4.2 and Figure 4.3 to 
discuss the design methodology used for 32-bit, 64-bit and 128-bit adders.       - refers to bitwise 
PG cell,     - refers to gray cell designed as AOI gate,     - refers to black cell designed as 
combination of AOI gate and NAND gate,       - refers to gray cell designed as OAI gate,     - 
refers to black cell designed combination of OAI gate and NOR gate,       - refers to an inverter 
gate,    - refers to a pair of inverter gates and       - refers to sum bit generating gate. [30] 
discusses the design of bitwise PG cell, gray cell, black cell and sum bit cells in detail. It is 
important to note that AOI gate takes in un-inverted inputs and generates an inverted output. 
Similarly the OAI gate takes in inverted inputs and generates an un-inverted output. Hence 
inverter gates are used as needed to provide the right input signal to the corresponding gates. 
Also, in order to simplify the adder design and analysis problem it is assumed that gates are 
identical in size to other gates if the gates are driving a similar load. Since many gates drive 
similar loads, in each stage of the adder structure there will be groups of identical gates. 
42 
 
 
 
 
 
 
 
 
 
Figure 4.1: 16-bit Brent-Kung adder 
 
 
 
 
 
 
 
Figure 4.2: 16-bit Skylansky adder 
 
 
 
 
 
 
 
Figure 4.3: 16-bit Kogge-Stone adder 
 
16   15  14  13 12  11  10   9    8    7    6     5   4    3     2    1  Cin 
Cout S15 S14 S13 S12  S11  S10  S9   S8    S7  S6   S5   S4   S3   S2   S1   S0 
Stage1 
Stage2 
Stage3 
Stage4 
Stage5 
Stage6 
Stage7 
Stage8 
: {Cin(inv1)} – {1,3,5,7,9,11,13,15,16} – {2,6,8,14} – {4,10,12} 
 
: {1} – {2,6,8,14(inv2)} – {3,7,11,15} – {5,13} – {9} 
 
: {1(inv1)} – {2} – {3} – {5,13(inv2)} – {7,15} – {11} 
 
: {4} – {5} – {7} – {11(inv2)} – {15} 
 
: {4,5,7(inv1)} – {6,8} – {9} – {11} – {15} 
 
: {10,12} – {13} 
 
: {10,12,13(inv1)} – {14} 
 
: {S0 to S15} – {Cout_AOI} – {Cout_inv1} 
 
16  15  14  13  12  11  10   9   8     7    6    5    4    3    2     1   Cin 
Cout S15 S14 S13  S12  S11 S10   S9   S8   S7   S6   S5   S4   S3   S2   S1   S0 
Stage1 
Stage2 
Stage3 
Stage4 
Stage5 
Stage6 
: {Cin(inv1)} – {1,3,5,7,9,11,13,15,16} – {2,6,8,10,14} – {4,12} 
 
: {1} – {2,6,8,10,14(inv2)} – {3,7,11,15} – {5,13} – {9} 
 
: {1(inv1)} – {2} – {3} – {5,13(inv2)} – {6,7,14,15} – {10} – {11} 
 
: {4,5,6} – {7} – {10,11(inv2)} – {12,13,14,15} 
 
: {4,5,6,7(inv1)} – {8,9,10,11,12,13,14} – {15} 
 
: {S0 to S15} – {Cout_AOI} – {Cout_inv1} 
16  15  14  13  12  11  10   9    8    7    6     5    4    3    2    1   Cin 
Cout S15 S14  S13 S12  S11 S10  S9   S8    S7   S6   S5   S4   S3  S2    S1  S0 
Stage1 
Stage2 
Stage3 
Stage4 
Stage5 
Stage6 
: {Cin(inv1)} – {1,2,3,4,5,6,7,8,9,10,11,12,13,14} – {15} – {16} 
 
: {Cin(inv1)} – {1} – {2,3,4,5,6,7,8,9,10,11,12,13} – {14,15} 
 
: {Cin,1(inv1)} – {2,3} – {4,5,6,7,8,9,10,11} – {12,13,14,15} 
 
: {Cin,1,2,3(inv1)} – {4,5,6,7} – {8,9,10,11,12,13,14,15} 
 
: {Cin,1,2,3,4,5,6,7(inv1)} – {8,9,10,11,12,13,14} – {15} 
 
: {S0 to S15} – {Cout_AOI} – {Cout_inv1} 
 
43 
 
In the above figures of the adder structures, each identical group is enclosed in curly braces ‘{ }’ 
and each logic gate in a stage is represented by the input bit position for quick reference to the 
identical groups. 
The path delay expressions are determined using the simplified RC delay model as discussed 
earlier for those paths in order to define all the possible critical path delays and also to include all 
the circuit’s non-identical transistor widths in the decision variable space. The wire capacitance as 
modeled in (3.26) is determined by the length of the interconnect wire. It is assumed that the wire 
capacitance for small wires is zero and only the wire capacitances of those wires that are 
significantly long were included as non-zero capacitance in calculating the circuit delay. 
 
4.2 Optimization Results 
The circuit analysis is done in the case of TSMC 0.18 micrometer process technology i.e. K=1, 
R=5KΩ, C=0.89fF and Cload=2C. Using the optimization methodology discussed in Section III, 
the three adder structures are optimized to determine the optimum transistor widths while meeting 
the performance constraints and the results are shown in Figure 4.4, Figure 4.5 and Figure 4.6 for 
32-bit, 64-bit and 128-bit adder sizes respectively. During the optimization, it is assumed that all 
the transistors widths in the first stage of the adders are kept fixed at the minimum transistor 
width. The first data point on the left-side of each power-delay curve refers to the minimum delay 
point found with no power constraint, the last data point on the right-side of each power-delay 
curve refers to the data point when minimum transistor widths are used for all the circuit 
transistors and the rest of the data points were determined by minimizing power with different 
delay constraints. The small solid circle on each curve refers to the minimum power-delay 
product data point. 
44 
 
 
Figure 4.4: Power-delay comparison plot for 32-bit adder case 
In the case of a 32-bit adder after circuit optimization the Skylansky adder is the fastest of all 
three adders, followed by Kogge-Stone and then Brent-Kung. However, the Brent-Kung has the 
minimum power-delay product followed by Skylansky and then Kogge-Stone. 
 
Figure 4.5: Power-delay comparison plot for 64-bit adder case 
In the case of a 64-bit adder after circuit optimization the Kogge-Stone adder is the fastest of all 
three adders, but its power consumption is excessive. It is followed by Skylansky and then Brent-
Kung in terms of speed, whereas Brent-Kung still has the minimum power-delay product 
followed by Skylansky and then the Kogge-Stone. 
300
800
1300
1800
2300
2800
850 1150 1450 1750 2050 2350
P
o
w
er
=∑
W
 (
m
ic
ro
m
et
er
s)
 
Delay (psec) 
Brent-Kung
Skylansky
Kogge-Stone
800
1800
2800
3800
4800
5800
6800
950 1450 1950 2450 2950 3450 3950
P
o
w
er
 =
∑
W
 (
m
ic
ro
m
et
er
s)
 
Delay (psec) 
Brent-Kung
Skylansky
Kogge-Stone
45 
 
 
Figure 4.6: Power-delay comparison plot for 128-bit adder case 
In the case of a 128-bit adder after circuit optimization the Kogge-Stone adder is still the fastest 
of all three adders, but again with excessive power consumption. Kogge-Stone is now followed 
by Brent-Kung and then the Skylansky in terms of speed. Whereas Brent-Kung still has the 
minimum power-delay product followed by Skylansky and then the Kogge-Stone. 
From the above three plots it can be observed that Kogge-Stone is the fastest when power 
consumption is not a problem. The Brent-Kung adder has the minimum power-delay product 
when optimum transistor widths are used in all the three adder sizes. Appendix A shows more 
details on the optimum transistor widths found and the histograms of the transistor widths for all 
the three adders at various constraints. 
 
 
 
1000
3000
5000
7000
9000
11000
13000
15000
1250 2250 3250 4250 5250 6250
P
o
w
er
=∑
W
 (
m
ic
ro
m
et
er
s)
 
Delay (psec) 
Brent-Kung
Skylansky
Kogge-Stone
46 
 
CHAPTER V 
 
 
CONCLUSION 
The proposed simplified RC delay model can quickly estimate the circuit delay for any complex 
CMOS circuit. It is more accurate and allows more efficient circuit design than the existing 
logical effort method in that it includes the effects of interconnect wire capacitance in 
determining the circuit delay and determines p and n transistor widths independently. It is much 
easier to use the numerical optimization techniques with our simplified RC delay equations rather 
than SPICE for the following reasons: 
1. The delays are explicit functions of transistor widths making it obvious what happened 
when any width is changed. 
2. The width dependence of the delays is smooth so that simple numerical methods can be 
used to quickly find the optimum transistor widths. 
Using the model and a heuristic gradient based optimization technique discussed in Chapter III 
the three popular parallel prefix adders are optimized and the power-delay performance 
comparison was done without having to do tedious layout and simulation. The results suggest that 
the Brent-Kung adder has the smallest power-delay product when optimum transistor widths are 
used, followed by the Skylansky adder and then the Kogge-Stone adder. While Kogge-Stone is 
the fastest, it comes at the expense of excessive power consumption.
47 
 
REFERENCES 
 
 
[1] L. W. Nagel, and D. O. Pederson, “SPICE (Simulation Program with Integrated Circuit 
Emphasis)”, Memorandum No. ERL-M382, University of California, Berkeley, Apr. 
1973. 
[2] R. P. Brent and H. T. Kung, “A Regular Layout for Parallel Adders”, IEEE Transactions on 
Computers, vol-31, no.3, pp. 260-264, Mar 1982. 
[3] J. Skylansky, “Conditional-Sum Addition Logic”, IRE Transactions, EC-9, pp. 226-231, 
Jun2 1960. 
[4] P. M. Kogge and H. S. Stone, “A Parallel Algorithm for the Efficient Solution of a General 
Class of Recurrence Equations”, IEEE Transactions on Computers, vol. 22, no. 8, pp. 
786-792, Aug 1973. 
[5] W.C. Elmore, “The transient response of damped linear networks with particular regard to 
wideband amplifiers”, Journal on Applied Physics, vol. 19, pp. 55-63, Jan 1948. 
[6] W. Shockley, “A unipolar field effect transistor”, Proc. IRE, vol. 40, pp. 1365-1376, Nov. 
1952. 
[7] T. Sakurai and A. R. Newton, “Alpha-Power Law MOSFET Model and its 
Applications to CMOS Inverter Delay and Other Formulas”, IEEE Journal of Solid -
State Circuits, vol. 25, no. 2, April 1990. 
[8] A. Slaz and M. Horowitz, “IRSIM: An Incremental MOS Switch-Level Simulator”, Proc. 
26
th
 Design Automation Conference, pp. 173-178, June 1989. 
48 
 
[9] J. Chang, “A Piecewise Linear Delay Modeling of CMOS circuits”, Ph.D. 
dissertation, ECEN, OSU, Stillwater, OK, 2006. 
[10] I. Sutherland, B. Sproull and D. Harris, “Logical Effort: Designing Fast CMOS Circuits”, 
CA: Morgan Kaufmann Publishers, 1999, pp. 1-83 
[11] A. Cauchy, “Méthode générale pour la resolution des systéms d’equations simulanées”, 
Comp. Rend. Sci. Paris, 25, pp. 46-89, 1847. 
[12] J. Barzilai and J. M. Bowrwein, “Two point step size gradient methods”, IMA Journal of 
Numerical Analysis, 8, pp. 141-148, 1988. 
[13] Jiren Yuan, Christer Svensson, “CMOS Circuit Speed Optimization Based on Switch Level 
Simulation”, Circuits and Systems, IEEE International Symposium, vol.3, pp. 2109 – 
2112, 1988. 
[14] Sachin S. Sapatnekar, Vasant B. Rao, Pravin M. Vaidya, Sung-Mo Kang, “An Exact 
Solution to the Transistor Sizing Problem for CMOS Circuits Using Convex 
Optimization”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and 
Systems, vol.12, no.11, pp. 1621-1634, November 1993.  
[15] Masaaki Yamada, Sachiko Kurosawa, Reiko Nojima, Naohito Kojima, “Synergistic 
Power/Area Optimization with Transistor Sizing and Wire Length Minimization”, IEEE 
Symposium on Low Power Electronics, pp. 50-51, 1994. 
[16] Jiren Yuan, Christer Svensson, “Principle of CMOS Circuit Power-Delay Optimization 
with Transistor Sizing”, IEEE International Symposium on Circuits and Systems, vol.1, 
pp.637-640, 1996. 
[17] A. Wroblewski, O. Schumacher, C. V. Schimpfle, J. A. Nossek, “Minimizing Gate 
Capciatance with Transistor Sizing”, IEEE International Symposium on Circuits and 
Systems, vol. 4, pp.186-189, 2001. 
49 
 
[18] H.Y. Chen, S.M Kang, “A New Circuit Optimization Technique for High Performance 
CMOS Circuits”, IEEE Transactions on Computer-Aided Design, vol.10, no.5, May 
1991. 
[19] Manjit Borah, Robert Michael Owens, Mary Jane Irwin, “Transistor Sizing for Low Power 
CMOS Circuits”, IEEE Transactions on Computer-Aided Design of Integrated Circuits 
and Systems, vol.15, no.6, June 1996. 
[20] Robert Rogenmoser, Hubert Kaeslin, “The Impact of Transistor Sizing on Power Efficiency 
in Submicron CMOS Circuits”, IEEE Journal of Solid-State Circuits, vol.32, no.7, July 
1997.  
[21] Maitham Shams, Mohamed I. Elmasry, “Delay Optimization of CMOS Logic Circuits 
using Closed-Form Expressions”, International Conference on Computer Design, pp. 
563-568, 1999. 
[22] Bradley S. Carlson, Suh-Juch Lee, “Delay Optimization of Digital CMOS VLSI Circuits by 
Transistor Reordering”, IEEE Transactions on Computer-Aided Design of Integrated 
Circuits and Systems, vol.14, no.10, October 1995. 
[23] Li Ding, Pinaki Mazumder, “Optimal Transistor Tapering for High-Speed CMOS Circuits”, 
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, 
pp. 708-713, 2002. 
[24] A. Baliga, D. Yagain, “Design of High speed adders using CMOS and Transmission gates 
in Submicron Technology: A Comparative Study”, IEEE Fourth International Conference 
on Emerging Trends in Engineering and Technology, pp.284-289, 2011. 
[25] K. Nehru, A. Shanmugam, S. Vadivel, “Design of 64-Bit Low Power Parallel Prefix VLSI 
Adder for High Speed Arithmetic Circuits”, IEEE International Conference on 
Computing, Communication and Applications, pp.1-4, 2012. 
50 
 
[26] A. N. Jayanthi, C. S. Ravichandran, “Comparison of Performance of High Speed VLSI 
Adders”, IEEE International Conference on Current Trends in Engineering and 
Technology, pp.99-104, 2013. 
[27] N. Poornima, V. S. Kanchana Bhaaskaran, “Power-Delay Optimized 32 Bit Radix-4, 
Sparse-4 Prefix Adder”, IEEE Fifth International Conference on Signal and Image 
Processing, pp.201-205, 2014. 
[28] L. G. Johnson, “Advanced Digital VLSI Design”, Lecture notes for ECEN 6263, OSU, 
Stillwater, OK, 2010. Available http://lgjohn.okstate.edu/6263/index.html 
[29] W. Rudin, “Principles of Mathematical Analysis”, p. 101, 1976 
[30] N. H. E. Weste and D. M. Harris, “CMOS VLSI DESIGN: A Circuits and Systems 
Perspective”, 4th ed., ch. 11, pp. 429-461, 2011. 
 
51 
 
APPENDICES 
 
APPENDIX A 
Histograms for 32-bit adder size for minimum delay case; where ∆t=R*C 
 
759 
396 
117 105 84 
19 9 7 2 6 6 
0
100
200
300
400
500
600
700
800
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
.  
o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 32-bit; minimum delay= 217.756*∆t 
Total Transistors
No. of n transistors
No. of p transistors
548 
733 
223 
154 
49 61 
4 6 1 3 0 
0
100
200
300
400
500
600
700
800
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 32-bit; minimum delay=209.595*∆t 
Total Transistors
No. of n transistors
No. of p transistors
498 
430 
192 
393 
211 
109 83 93 
135 
56 
0 
0
100
200
300
400
500
600
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 32-bit; minimum delay=214.401*∆t 
Total transistors
No. of n transistors
No. of p transistors
52 
 
Histograms for 32-bit adder size with delay constraint of tmax=250*∆t, where ∆t=R*C 
 
 
 
 
 
1148 
257 
90 
5 6 4 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 32-bit; tmax=250*∆t 
Total Transistors
No. of n transistors
No. of p transistors
1154 
543 
77 
4 3 1 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 32-bit; tmax=250*∆t 
Total Transistors
No. of n transistors
No. of p transistors
998 1022 
168 
12 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 32-bit; tmax=250*∆t 
Total transistors
No. of n transistors
No. of p transistors
53 
 
Histograms for 32-bit adder size with delay constraint of tmax=300*∆t, where ∆t=R*C 
 
 
 
 
 
1273 
225 
8 4 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 32-bit; tmax=300*∆t 
Total Transistors
No. of n transistors
No. of p transistors
1377 
394 
9 2 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1600
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 32-bit; tmax=300*∆t 
Total Transistors
No. of n transistors
No. of p transistors
1475 
717 
8 0 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1600
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 32-bit; tmax=300*∆t 
Total transistors
No. of n transistors
No. of p transistors
54 
 
Histograms for 32-bit adder size with delay constraint of tmax=350*∆t, where ∆t=R*C 
 
 
 
 
 
1320 
184 
6 0 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 32-bit; tmax=350*∆t 
Total Transistors
No. of n transistors
No. of p transistors
1579 
199 
4 0 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1600
1800
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 32-bit; tmax=350*∆t 
Total Transistors
No. of n transistors
No. of p transistors
1770 
430 
0 0 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 32-bit; tmax=350*∆t 
Total transistors
No. of n transistors
No. of p transistors
55 
 
Histograms for 32-bit adder size for minimum power-delay product 
 
 
 
 
 
1237 
246 
14 8 5 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 32-bit; minimum power-delay product 
Total Transistors
No. of n transistors
No. of p transistors
962 
723 
89 
2 5 0 1 0 0 0 0 
0
200
400
600
800
1000
1200
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 32-bit; minimum power-delay product 
Total Transistors
No. of n transistors
No. of p transistors
1344 
784 
56 16 0 0 0 0 0 0 0 
0
200
400
600
800
1000
1200
1400
1600
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 32-bit; minimum power-delay product 
Total transistors
No. of n transistors
No. of p transistors
56 
 
 
 
APPENDIX B 
Histograms for 64-bit adder size for minimum delay case; ∆t=R*C 
 
 
1214 
1294 
184 
298 
19 30 8 7 3 5 2 
0
200
400
600
800
1000
1200
1400
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 64-bit; minimum delay=269.982*∆t 
Total Transistors
No. of n transistors
No. of p transistors
1385 
1489 
329 
150 
336 
98 32 32 3 9 5 
0
200
400
600
800
1000
1200
1400
1600
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 64-bit; minimum delay=245.804*∆t 
Total Transistors
No. of n transistors
No. of p transistors
1058 
924 
1019 
627 
250 288 259 
108 
201 
280 
0 
0
200
400
600
800
1000
1200
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 64-bit; minimum delay=239.215*∆t 
Total transistors
No. of n transistors
No. of p transistors
57 
 
 
Histograms for 64-bit adder size with delay constraint of tmax=300*∆t, where ∆t=R*C  
 
 
 
 
1687 
1085 
248 
21 13 4 3 1 1 1 0 
0
200
400
600
800
1000
1200
1400
1600
1800
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 64-bit; tmax=300*∆t 
Total Transistors
No. of n transistors
No. of p transistors
2356 
1358 
135 
8 8 1 0 0 2 0 0 
0
500
1000
1500
2000
2500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 64-bit; tmax=300*∆t 
Total Transistors
No. of n transistors
No. of p transistors
2238 
2092 
532 
120 
16 16 0 0 0 0 0 
0
500
1000
1500
2000
2500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 64-bit; tmax=300*∆t 
Total transistors
No. of n transistors
No. of p transistors
58 
 
 
Histograms for 64-bit adder size with delay constraint of tmax=350*∆t, where ∆t=R*C  
 
 
 
2308 
699 
43 10 2 2 0 0 0 0 0 
0
500
1000
1500
2000
2500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 64-bit; tmax=350*∆t 
Total Transistors
No. of n transistors
No. of p transistors
2906 
937 
12 10 1 1 1 0 0 0 0 
0
500
1000
1500
2000
2500
3000
3500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 64-bit; tmax=350*∆t 
Total Transistors
No. of n transistors
No. of p transistors
2629 
2289 
64 32 0 0 0 0 0 0 0 
0
500
1000
1500
2000
2500
3000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 64-bit; tmax=350*∆t 
Total transistors
No. of n transistors
No. of p transistors
59 
 
 
Histograms for 64-bit adder size with delay constraint of tmax=400*∆t, where ∆t=R*C  
 
 
 
2393 
651 
15 3 2 0 0 0 0 0 0 
0
500
1000
1500
2000
2500
3000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 64-bit; tmax=400*∆t 
Total Transistors
No. of n transistors
No. of p transistors
3244 
608 
11 3 1 1 0 0 0 0 0 
0
500
1000
1500
2000
2500
3000
3500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 64-bit; tmax=400*∆t 
Total Transistors
No. of n transistors
No. of p transistors
3050 
1916 
48 0 0 0 0 0 0 0 0 
0
500
1000
1500
2000
2500
3000
3500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 64-bit; tmax=400*∆t 
Total transistors
No. of n transistors
No. of p transistors
60 
 
Histograms for 64-bit adder size for minimum power-delay product 
 
 
 
 
1746 
1030 
236 
28 12 4 3 2 2 1 0 
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 64-bit; minimum power-delay product 
Total Transistors
No. of n transistors
No. of p transistors
2112 
1446 
275 
15 10 6 2 0 0 2 0 
0
500
1000
1500
2000
2500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 64-bit; minimum power-delay product 
Total Transistors
No. of n transistors
No. of p transistors
3179 
1723 
80 32 0 0 0 0 0 0 0 
0
500
1000
1500
2000
2500
3000
3500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 64-bit; minimum power-delay product 
Total transistors
No. of n transistors
No. of p transistors
61 
 
APPENDIX C 
Histograms for 128-bit adder size for minimum delay case; ∆t=R*C 
 
 
 
2684 
2525 
500 381 
49 32 6 5 2 9 1 
0
500
1000
1500
2000
2500
3000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 128-bit; minimum delay=332.852*∆t 
Total Transistors
No. of n transistors
No. of p transistors
4032 
2570 
980 
608 
246 
54 9 4 2 7 0 
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 128-bit; minimum delay=336.602*∆t 
Total Transistors
No. of n transistors
No. of p transistors
2104 
2003 1972 
1696 
523 550 
756 
388 376 
856 
64 
0
500
1000
1500
2000
2500
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 128-bit; minimum delay=294.256*∆t 
Total transistors
No. of n transistors
No. of p transistors
62 
 
Histograms for 128-bit adder size with delay constraint of tmax=400*∆t, where ∆t=R*C  
 
 
 
 
3595 
2327 
244 
10 7 3 6 2 0 0 0 
0
500
1000
1500
2000
2500
3000
3500
4000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 128-bit; tmax=400*∆t 
Total Transistors
No. of n transistors
No. of p transistors
5230 
2998 
254 18 5 0 0 2 4 1 0 
0
1000
2000
3000
4000
5000
6000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 128-bit; tmax=400*∆t 
Total Transistors
No. of n transistors
No. of p transistors
4758 4862 
1172 
432 
32 32 0 0 0 0 0 
0
1000
2000
3000
4000
5000
6000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 128-bit; tmax=400*∆t 
Total transistors
No. of n transistors
No. of p transistors
63 
 
Histograms for 128-bit adder size with delay constraint of tmax=550*∆t, where ∆t=R*C  
 
 
 
 
4700 
1474 
11 8 1 0 0 0 0 0 0 
0
1000
2000
3000
4000
5000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 128-bit; tmax=550*∆t 
Total Transistors
No. of n transistors
No. of p transistors
6548 
1944 
12 1 6 1 0 0 0 0 0 
0
1000
2000
3000
4000
5000
6000
7000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 128-bit; tmax=550*∆t 
Total Transistors
No. of n transistors
No. of p transistors
8063 
3017 
112 96 0 0 0 0 0 0 0 
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 128-bit; tmax=550*∆t 
Total transistors
No. of n transistors
No. of p transistors
64 
 
Histograms for 128-bit adder size for minimum power-delay product 
 
 
3483 
2062 
544 
52 31 6 4 2 3 7 0 
0
500
1000
1500
2000
2500
3000
3500
4000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Brent-Kung 128-bit; minimum power-delay product 
Total Transistors
No. of n transistors
No. of p transistors
4908 
3106 
454 
21 14 2 0 0 2 5 0 
0
1000
2000
3000
4000
5000
6000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Skylansky 128-bit; minimum power-delay product 
Total Transistors
No. of n transistors
No. of p transistors
5741 
4419 
952 
112 64 0 0 0 0 0 0 
0
1000
2000
3000
4000
5000
6000
7000
1 1 to 2 2 to 3 3 to 4 4 to 5 5 to 6 6 to 7 7 to 8 8 to 9 9 to 10 10
N
o
. o
f 
tr
an
si
st
o
rs
 
W/Wmin 
Kogge-Stone 128-bit; minimum power-delay product 
Total transistors
No. of n transistors
No. of p transistors
 VITA 
 
Sunil Kumar Lakkakula 
 
Candidate for the Degree of 
 
Doctor of Philosophy 
 
Thesis:    CMOS CIRCUIT SPEED AND POWER OPTIMIZATION USING 
SIMPLIFIED RC DELAY MODEL 
 
 
Major Field:  Electrical Engineering 
 
Biographical: 
 
Education: 
 
Completed the requirements for the Doctor of Philosophy in Electrical 
Engineering at Oklahoma State University, Stillwater, Oklahoma in May, 2015. 
 
Completed the requirements for the Master of Science in Electrical Engineering 
at Oklahoma State University, Stillwater, Oklahoma in December, 2009. 
  
Completed the requirements for the Bachelor of Technology in Electrical and 
Electronics Engineering at Acharya Nagarjuna University, Guntur, Andhra 
Pradesh, India in April, 2007. 
 
Experience:   
 
Graduate Research Associate, Oklahoma State University. 
 
Professional Memberships:   
 
Member, International Society of Automation 
Member, Golden Key 
 
 
 
 
 
