Efficient arithmetic using self-timing by Lu, Shih-Lien et al.
AN ABSTRACT OF THE THESIS OF
Ravichandran Ramachandranfor the degree ofMaster of Sciencein
Electrical and Computer Engineering presented onSeptember 1 1994.
Title: Efficient Arithmetic Using Self-timing
Abstract approved:
Shih-Lien Lu
The recent advances in VLSI technology havefacilitated feature shrinking
and hence a rapid increase in the levels ofintegration at the chip level. This increase
in the level of integration has brought along withit a host of other constraints, the
most crucial being timing managementand increased power dissipation.Such
constraints potentially prevent the full exploitationof the increased processing power
made possible by technological advances.
Timing in complex digital systems has traditionallybeen managed by using
a global clock, controlled bywhich all the actions take place in lock-step. An alternative
means of managing timing,called self-timing, simplifies the problems of timing manage-
ment and results in a reduced powerdissipation of complex digital systems. Systems
designed using this self-timed or asynchronous protocol,work on a principle of hand-
shaking, running at their own speed, governed by localtimers and the availability
of data on which to work.However, this hand-shaking introduces an overheadboth
in terms of hardware and computational speed.
The work presented here examines theimplementation of an adder, called
a Parallel Half-Adder (PHA),which gains its speed by exploiting the power of asynchro-
Redacted for Privacyny to calculate the sum. Theadder has been implemented in the form of a tunable
micropipeline and compared to traditional adders in terms ofhardware complexity
and speed.Comparable results have been obtained, implying that theoverhead due
to hand shaking is justified and theperformance improvements due to self-timing
can be fully exploited. The designof an array divider using the PHA has also been
presented.Efficient Arithmetic Using Self-timing
by
Ravichandran Ramachandran
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Master of Science
Completed September 2, 1994
Commencement June, 1995APPROVED:
Assistant Professor of Electrical and Computer Engineering in charge of major
Head of Department oiEiectrical and Computer Engineering
Dean of Graduate School
Date thesis is presented: September 2, 1994
Typed by Ravichandran Ramachandran for : Ravichandran Ramachandran
Redacted for Privacy
Redacted for Privacy
Redacted for PrivacyACKNOWLEDGEMENT
This work is dedicated to my parents and grand parents, but for whose under-
standing and help it would never have been possible. I extend gratitude to Ramya
Ramachandran, my loving sister, for providing unconditional support through-out.
Special thanks to Dr. Shih-Lien Lu, my major professor, for his inspiring
guidance during my stay here at Oregon State University.I owe most of the skills
that I picked up as a graduate student to him. Thanks to Dr. Bella Bose, myminor
professor, for being available always, for discussions, and for his interest in myproject.
Dr. Satish Reddy, the Graduate Council Representative and Dr. Vijay Tripathideserve
special thanks for consenting to be a part of the graduate committee and formaking
time for this defense amidst their tight schedules.Thanks to Ms. Rita Wells for
helping me expedite most of the paper work during my stay here at OSU.
My research mate Chih-Ming Chang deserves a special mention for his insight-
ful comments on the project. Thanks to La lit Merani an ex-member of this research
group and the other members of this group fortheir comments, and to Shankar Pennathur
for patiently reading this document and providing valuable comments.
My very special friends Karthik Ramamurthy, Sridhar Kotikalapoodi, Sudha
Subbaraman and Jayanthi Chandramouli deserve special mention for listening to my
tales of woe under times of stress and distress.
Thanks to the rest of the gang comprising Vasudev Tanikella, Pattamata Srini-
vas, Ashuthosh Kale, Manoj Kumar, Praveen Manapragada,Satish Kulkarni and Shivani
Gupta for making the stay here at Corvallis memorable.
This research is funded by a National Science Foundation Grant,
#MIP-9211510.
ECE Support did a wonderful job responding to my weird queries and to
my complaints about the computer system here.TABLE OF CONTENTS
CHAPTER 1.
INTRODUCTION 1
1.1Motivation 2
1.1.1The Power Factor 3
1.1.2Physical Constraints 4
1.1.3Alleviation of the Ground Bounce Problem 5
1.1.4Scalability 5
1.1.5Miscellaneous Advantages 5
1.2Disadvantages 6
1.2.1The handshaking overhead 6
1.2.2Race Conditions 7
1.2.3Meta-stability 7
1.3Organization of this document 8
CHAPTER 2.
ASYNCHRONOUS SYSTEM IMPLEMENTATION 10
2.1Introduction 10
2.2Signalling schemes 11
2.2.1The 4-cycle request-acknowledge protocol 11
2.2.2Two Phase or Transition signalling 13
2.2.3Comparison of the two schemes 13
2.3Event Control Primitives 15
2.4Micropipelines 15
2.5Completion Signal Generation 18
2.5.1DCVSL(Differential Cascode Voltage Switch Logic) 18
2.5.2 ECDL (EnableDisable CMOS Differential Logic) 20
CHAPTER 3.
ARCHITECTURE OF THE PARALLEL HALFADDER 22
3.1Addition Schemes 22
3.1.1Ripple Carry Addition 233.1.2Conditional Sum Adder 24
3.1.3CarryCompletion Sensing Adders 25
3.1.4Carry Lookahead adders 26
3.2The Parallel Halfadder 28
3.3The Proposed Architecture 33
3.4Including Subtraction 35
3.5Overflow detection 36
3.6Hardware consumption estimation 37
3.7Computational speed of the PHA 38
CHAPTER 4.
THE PHA: A PERFORMANCE EVALUATION 40
4.1An Efficiency model 40
4.2Investment on adders 41
4.2.1The Area time model 41
4.2.2The log of Gate count model 42
4.2.3The Area-time-squared Model 42
4.3Determination of Hardware complexity and computation time43
4.4Efficiency calculations 45
4.5Limitations of the efficiency model 47
4.6Improving the Efficiency of the PHA 48
4.6.1The Tandem Adder 49
4.6.2Hiding the Zero-Detect delay 50
CHAPTER 5.
IMPLEMENTATION OF THE PHA 55
5.1Implementation Strategies 55
5.2Partitioning the adder 56
5.3The Final Implementation 58
5.4Circuit diagrams 58
5.5Simulation Results 63
CHAPTER 6.
A SELF-TIMED 16 BIT DIVIDER USING THE PHA 70
6.1Array Division 70
6.2The design of a 16-Bit Divider 72CHAPTER 7.
CONCLUSIONS AND FUTURE WORK 76
6.3Conclusions 76
6.4Future Work 77
BIBLIOGRAPHY 79LIST OF FIGURES
Figure 2.1. Two blocks communicating through handshaking 10
Figure 2.2. Detailed Request and acknowlege signal propagation 11
Figure 2.3. Timing Diagram for 4 cycle handshake 12
Figure 2.4. Transition Signalling 14
Figure 2.5. A Muller-C element 15
Figure 2.6. A Micropipeline using transition signalling 17
Figure 2.7. DCVSL Gate Structure 19
Figure 2.8. Structure of an ECDL gate 21
Figure 3.1. Carry Ripple Adder 23
Figure 3.2. A 2 level 16-bit CLA adder 27
Figure 3.3. Carry Propagation for two 4bit numbers 29
Figure 3.4. C program that simulates the PHA[20] 31
Figure 3.5. Software Simulation of the PHA: Fast Convergence 32
Figure 3.6. Software simulation of the PHA: Worst Case convergence 33
Figure 3.7. The architecture of a PHA 34
Figure 3.8. Subtraction of 65532 from 23456 36
Figure 4.1. An Adder in a pipeline 41
Figure 4.2. Computation time and gate count of adders 45
Figure 4.3. Efficiency estimation of adders 46
Figure 4.4. Alternative addition of two numbers 49
Figure 4.5. Tandem addition 50
Figure 4.6. Modified Adder to support " predict incomplete " 53
Figure 5.1. Partitioning the PHA into cells 57
Figure 5.2. The PHA captured using VlEWlogic® 59
Figure 5.3. The insides of the Controlled Register 60
Figure 5.4. The computation plane 61
Figure 5.5. CONTROL unit for the PHA 62
Figure 5.6. Including Subtraction: Carry Select 64
Figure 5.7. Simulation of the PHA using VIEWsim 65
Figure 5.8. Simulation of the PHA(Worst Case) 66
Figure 5.9. Subtraction using the PHA 67Figure 5.10. SPICE simulation of the CONTROL circuit 68
Figure 5.11. SPICE simulation of the bitslice 69
Figure 6.1. A 16 bit array Divider 71
Figure 6.2. Design capture of the 16bit divider 73LIST OF TABLES
Table 3.1: Hardware consumption estimation of a n bit PHA 38
Table 4.1: Gate counts and speeds of different adders 44
Table 4.2: Gate count and computational speeds of the CLA and the PHA 48Efficient Arithmetic Using Self-timing
CHAPTER 1. INTRODUCTION
The size and complexity of digital systems has increased tremendously in the last
quarter century. This increase in complexity has been supported in part by the advances in
VLSI technology. Increased processing power is a direct consequence of scaling down fea-
ture sizes in the IC process. However, this level of integration has also brought along with
it a host of other constraints that potentially prevent the exploitation of this increased proces-
sing power. One of the key constraints is managing timing. Timing management has now
become formidable, if not insurmountable. A ubiquitous paradigm in recent times has been
the presence of a global clock which introduces a sense of discreteness into time.It is this
master clock in "synchronism" with which all the actions take place in lockstep within the
system. In recent years clock speed has been used as a universal distinguishing factor for
seemingly identical systems. It has become a signifier of computing machismo.
It has been pointed out that the global clocking scheme has been pushed to its lim-
it[1]. This has necessitated the need for alternate methodologies of managing timing in digi-
tal systems. One direct offshoot of this search is the advent of the "asynchronous protocol,"
which dictates that various blocks of a system maintain a sense of timing with respect to the
blocks to which they are connected to. Variously called as selftimed or asynchronous or
speed independent, systems designed with this protocol run at their own speed, governed by
local timers and the availability of data on which to work. This in turn introduces a sense
of locality in timing management that can be exploited to overcome some of the constraints
imposed by global clocking.
This work examines the design and implementation of an asynchronous array di-
vider. The underlying philosophy of this work is to provide an example of mapping an algo-2
rithm which is inherently asynchronous in nature to a system which implements a globally
asynchronous design philosophy and to analyze its performance.
1.1Motivation
"We might say that the clock enables us to introduce a discreteness into time,
so that time for some purposes can be regarded as a succession of instants
instead of a continuous flow. A digital machine must essentially deal with
discrete objects, and in case of the ACE (automatic computing engine) that
is made possible by the use of a clock. All other digital computing machines
except for humans and other brains that I know of do the same. One can think
up ways of avoiding it, but they are very awkward. "
Alan Turing, 1947
Lecture to the London
Mathematical Society
The idea of using a clock in a system is as old as the first computer made. Some
of the important factors that have led to the use of clocks extensively in digital systems were:
The clock gained a symbolic importance. Seemingly identical sys-
tems could be compared based on their clock speeds.
A lock step execution allows for a certain amount of discreteness
in the system, and tracing through the operation is remarkably
easy.
It is hard to even find a metaphor on the basis of which to imagine
an asynchronous system. This has led VLSI designers to use the
clocked logic framework,which is easier to design with, and un-
derstand.
Going asynchronous then, means unlearning some concepts and relearning a few
others. Asynchronous system implementation is not new. Digital communication systems,
which are characterized by communication channel delays which are many orders greater3
than the internal clocking speed of the chip, use some ingenious ways to maintain the correct
sequence of operation.
The prime factors that make this relearning worth the while are enumerated below:
1.1.1The Power Factor
As an example, An asynchronous Personal Computer's (PC) Central Processing
Unit (CPU) running under maximum load will consume as much power as a synchronous
CPU, but when loaded partially it expends only as much power[2]. It is a known fact that
most PC CPUs are rarely loaded to their maximum capacity. This saving is made possible
as only some of the units, say, the multiplier unit or the divider unit, needs to be active most
of the time. Any unit not used for the execution of the particular instruction does not con-
sume any power. At the same time when needed it can become active with minimum setup
time. Generalizing this example to any system, only the active components of the system
will consume power. This implies that the worst case power drawn will be equal to the syn-
chronous case and in all other cases the power consumption is lower, assuming that the power
consumed in the hand shaking interface is negligible.
The power consumption of a CMOS system depends on it's total parasitic capaci-
tance[3]. In a globally clocked system, the clock signal should be capable of driving the load
contributed by all the memory elements in the system. This is ensured by employing a ta-
pered buffer to distribute the clock signal. A tapered buffer consists of a string of inverters,
each varying in size from the immediate predecessor in the string by a constant magnification
factor[4]. As an example, if a system comprising 10000 memory elements and a taper clock
buffer possessing a magnification factor of m is considered, then the effective parasitic load
after buffering would be :
10000 in the last buffer stage
10000 /m in the preceding stage4
10000/m2 in the preceding stage
and so on. Totalling all these parasitic loads,
1000010000 Ci = 10000 + 2+ . 1
Thus for a magnification factor of 4, which is typical for most clock buffers, Ctotai would
be approximately 13300, which represents a 33% increase over original value.Since
asynchronous systems use local clock generation strategies, rarely are such clock buffers
employed. Hence the overall capacitance of such a system is potentially lower. This trans-
lates directly into reduced power dissipation.
1.1.2Physical Constraints
Clock skew is another important consideration. The very fact that the clock is glob-
al imposes constraints like clock skew, which is the phase difference of the global synchro-
nization signal at different locations in the system. This problem has been aggravated by
increasing the level of integration in the chip. The same global signal now needs to connect
to a lot more components, thus increasing the load on the signal line and hence increasing
the possibility of skewing. As the various interconnected blocks in an asynchronous system
communicate with each other by means of completion and other synchronization signals that
are locally generated, this problem of clock skew can be virtually avoided.
The life of a chip fabricated using asynchronous timing can exceed the life of its
synchronous counterpart. An argument in favour would be this simple example: An increase
in clock frequency would mean reduction in rise and fall times of the clock. Reduced rise
and fall times mean sharper transient currents and hence an increased problem of electro-
migration. This can potentially cause the lines carrying the clock signals to snap over time.
Generalizing this argument, since various blocks in an asynchronous system start working
only when they have valid data available to them, and they generate control signals only on5
events like completion on the interface, these transients are greatly reduced. Consequently
the chip life may be increased.
1.13Alleviation of the Ground Bounce Problem
Ground bounce[4], is the term applied to describe the excursion of the ground lines
from their normal zero potential in a large system. Broadly described as noise, ground
bounce can cause spurious switching of gates and hence incorrect operation. This problem
is noticed when a large number of loads are switched simultaneously. In a self-timed system,
only the parts of the system that are necessary for processing are active. Thus, on the average,
the total number of loads switching is tremendously reduced in comparison to a clock driven
implementation where loads always switch in response to a clock pulse. This reduction in
the number of loads actively switching simultaneously, can substantially reduce the problem
of ground bounce.
1.1.4Scalability
There is a possibility that a system might need some extended functionality some
where down its life cycle. In such cases, it is easy to improve an asynchronous system. In
particular, pipelined systems can benefit by this. Since no global constraints exist, a new
block can be added as long as it follows the hand-shaking convention. Synchronous systems
on the other hand resist such midlife improvement cycles. Radical redesigning of sections
may be required to incorporate the newly required functionality in a synchronous system.
1.1.5Miscellaneous Advantages
Another esoteric advantage is that better and faster circuits can be designed by
VLSI designers. As an example, the speed of an adder circuit depends on the number of car-6
ries that need to be propagated for different operands. In the case of a synchronous system
the worst case is taken into account in the design process, as everything should keep in step
with the clock. The completion signal generation in the self-timed system can determine
when carry propagation is complete, and thus potentially the process can be speeded up.
A subtle advantage of using this philosophy is that it promotes modular design.
This will simplify the laying out the design at the chip level. A modular design yields very
well to silicon implementation. This modularity will aid the macro-cell design methodology
adopted by ASIC CAD tools.This would also mean that incremental performance gains
are easier to implement. Any element or block in the critical path of a system can be replaced
by a improved version without disturbing any other parts of the system.
1.2Disadvantages
1.2.1The handshaking overhead
One directly observable disadvantage of such asynchronous handshaking is the
overhead involved in the handshaking hardware itself. Since explicit synchronizations are
essential at a number of points in the system, the hardware costs associated with these could
be tremendous. Another potential disadvantage could be a reduction in the throughput, since
a finite amount of time is involved in ensuring synchronization. Both of these problems can
be offset if systems having large computation threads between synchronization points are
considered. Such systems would reduce the synchronization time and hardware costs to a
small fraction of the overall computation time and hardware cost. This is a trade off which
will pay off if the system is partitioned intelligently7
1.2.2Race Conditions
A change in signal values in more than one line in a combinational circuit can cause
temporary errors in output values. In a feed-back free systemthis erroneous operation is
transient, and the output reaches a stable state after a finite amount of time determinedby
the delay paths in the combinational network. In cases where these delays are bounded,these
temporary errors can be eliminated by including all the primeimplicants in the network and
ensuring that only one input changes at a time[5]. However in sequential networks these can
cause steady state errors. In bounded delay casesthese can be eliminated by introducing re-
dundant states and making the feedback loop delay long enough to ensure proper operation.
This poses a problem in asynchronous sequential circuits with unbounded gate de-
lays. Use of data detectors and spacers[6] have been reported. This decreases the hardware
efficiency of the system.
1.23Meta-stability
Any cross-coupled gate pair potentially exhibits meta-stability. In synchronous
systems, the clock rates are so tailored that the clock signal always trailsthe data. Another
place where meta-stability is noticed is when mutual exclusion for the access of a resource
is to be absolutely guaranteed. This sort of a mutual exclusion is usually handled by an arbi-
ter. Since both synchronous and asynchronous systems carry such mutualexclusion ele-
ments, the problem of meta-stability exists in both. Ensuring that the data and the clockbear
the correct relation with each other is easier to ensure and visualize in a synchronous system.
However, this problem is brought to the fore in asynchronous systems as there is
no master driver like a clock with respect to whichmeta-stability is studied. Hence the prob-
lem is one of visualization. While it is true that an asynchronous system will have more syn-
chronization points and mutual exclusion elements than its synchronous counterpart, this8
just increases the possibility of arbiter failures anddoes not introduce a new problem. Hence
the problem is present in both types of systems, but possiblyasynchronicity increases the
possibility of its occurrence. If the data generates the synchronizationsignal, then there
would be no synchronization signal until the data is valid. This is apossible solution to the
meta-stability problem and needs further scrutiny. Another solution tothis problem is the
use of Q Flops[7].
Weighing the advantages and disadvantages, it seems quite likelythat asynchro-
nous systems could perform betterthan their synchronous counterparts in the following
cases. The list is indicative of the typesof possible applications but is by no means exhaus-
tive.
Cases where power dissipation is an important consideration.
Cases where the computation algorithm is in itself asynchronous in
nature. This work is one such example.
Cases where the clock becomes the limiting factor in the perfor-
mance of the system.
Cases where the computation threads are large in nature and hence
would justify the overhead of asynchronous system implementa-
tion.
Thus asynchronous systems offer an alternative paradigm of timing management
of complex digital systems. Weighing the advantages presented above it appearsthat this
paradigm warrants further consideration. This document will study one such asynchronous
system, an asynchronous parallel halfadder.
13Organization of this document
In Chapter 2, we present general asynchronous system design methodologiesand
choose a particular implementation philosophy for the work presented here. InChapter 3,9
we explain a new addition scheme which exploits the input bit patterns to converge to a sum
rapidly and the corresponding architecture necessary to implement the adder in hardware.
In Chapter 4, we characterize the adder, compare it with some known implementations of
adders and discuss methods to improve it. Chapter 5 presents the implementation strategies
to implement the adder in silicon and presents the simulation results of addition using the
adder. Chapter 6 describes the application of the adder to a non-restoring array divider. In
Chapter 7 we present conclusions and indicate directions for future work.10
CHAPTER 2. ASYNCHRONOUS SYSTEM IMPLEMENTATION
Timing management in an asynchronous system depends wholly on the generation
of local handshaking signals. The interconnect between two blocks of a system synchro-
nizes these signals and thus the control flow is effected. Two popular protocols governing
the interconnect behavior exist. These are the transition signalling scheme and the four-
phase scheme. The transition signalling scheme is also called as the two-phase or non-re-
turn-to-zero scheme. The four-phase scheme is also called the Muller signalling or return-to-
zero scheme. This chapter discusses these protocols and touches on theiradvantages and
disadvantages in general. A few important methods to generate completion signals to flag
actions in an asynchronous system are also discussed.
2.1Introduction
A high level view of two interconnected blocks and the handshaking signals that are
effectively needed to control their operation is shown in Fig. 2.1.
Block A
Request
4cknowledge
Data
Block B
Figure 2.1. Two blocks communicating through handshaking11
Block A places data on the data line and sends a request signal to block B. Block B uses the
data after receiving the request signal. After the data is no longer needed in the data line,
Block B produces an acknowledge signal, indicating to A that it can change data at the data
line. This kind of handshaking scheme implies that A can pass on data to B but not vice-ver-
sa. A more complete representation of the handshaking scheme shown in Fig.2.2, allows
two way communication between the blocks. The signals Rin, Rout, Ain and Aout indicate the
handshaking signals.
Data_in Data outData_in
Block A
V
Rout
Block B
Data out
_
/
flout ..,%® Ain
Interconnect
Figure 2.2. Detailed Request and acknowlege signal propagation
This request-acknowledge model is basic to all asynchronous systems. The order and
sense of the request and acknowledge signals distinguishes different systems.
2.2Signalling schemes
2.2.1The 4-cycle request-acknowledge protocol
The simplest interconnection circuit is one that follows the four-phase handshaking
protocol[8]. Assume all the signals Rin, Rout, Ain, km are all low initially (Rin-, Rout-, Ain-,
Aoui)Block A after its computation raises Rin (Rin+), indicating the availability of data12
on its data line. Signal Ain is initially low indicating that Block B is ready to accept the data
placed on the data like by Block A. The handshake circuit raises Aout (Aout+), indicating
to Block A that its output datum has been accepted and that Rin can be reset (Rin). The hand-
shaking interface then raises Rout (Rout+),indicating to Block B to begin computation. Block
B finishes its computation and this information is fed back using the Ain signal, At This
resets Rout which in turn resets Ain (Ain l. This completes the four-phase handshaking
scheme. Figure 2.3 depicts the timing diagram for a system following a four-phase hand-
shaking scheme.
Data
Rin
Ain
Aout
Rout
Data Valid
Block A
Comp
Figure 2.3. Timing Diagram for 4 cycle handshake
Computation in a four-phase scheme is always initiated on rising edge or falling edge
transitions and all the signals return to '0' after the computation is complete. Thus the
4-cycle protocol is also called as the return-to-zero(RZ) signalling scheme.
This type of signalling is most compatible with logic families like ECDL[9] and
DCVSL[10]. Since these logic families are differential in nature, the imbalance in the output
signals when the computation is complete can be used for completion signal generation.13
The signals thus generated are inherently four-phase since all logic families work on enabled
and disable states.
2.2.2Two Phase or Transition signalling
Two phase or transition signalling does not depend on the sense of the signals, in that
it does not matter if the signal is at a logic high or low. All signalling is handled as events
and the interconnect deals with transitions rather than logic levels. For an explanation of this
scheme proposed by Ivan Sutherland [1], refer to Fig. 2.2. Initially it is assumed that all the
request and acknowledge signals are at a logic '0'. Block A after its computation raises the
signal Rin. The interconnect checks Ain ( Ain is a logic '0' in this case), to ascertain if Block
B is ready to accept data. This causes the signal Aout to be raised, acknowledging the accep-
tance of data. Rout is raised, indicating computation in Block B. Block B completes its com-
putation and eventually produces a completion signal. This signal is fed-back in the form
of Ain. This cycle is repeated with the control signals just toggling in state for the next cycle.
It should be noted that the interconnect described above is the non-pipelined interconnect
scheme. Figure 2.4. depicts the timing diagram for a transition signalling scheme.
2.23Comparison of the two schemes
One direct advantage of two-phase implementation over its four-phase counterpart
is that it involves lesser number of transitions at the handshaking interface. This implies that
the power dissipation could be minimized. Another advantage is that, the two cycle imple-
mentation has been shown to be faster[11]. However, all the logic families known produce
only four phase completion signals. Hence using a standard logic family like ECDL or
DCVSL would mean that the the completion signals produced by them must be converted
to two-phase. This is a hardware overhead which might be justified in large systems where14
the cost of such hardware is extremely small compared to the system's overall hardware com-
plexity.
IOne cycle I Next cycle
Figure 2.4. Transition Signalling
On the other hand, four-phase schemes initiate computation only on the rising edge
or the falling edge. Hence, the signals need tobe reset once every cycle. Some time overhead
could be encountered in resetting these control signals. This increases the total number of
transitions in the handshaking interface and hence increases the power dissipation. The ad-
vantage however, is that all logic isfour-phase and hence the return-to-zero signalling
scheme is directly compatible with logic families. It is also easier to intuitively understand
a four-phase scheme as it is closely linked to logiclevels. Another advantage of using four
cycle clocking is that simple edge triggered flip-flops and registers can be used for storing
data, unlike in a 2 cycle case where special types of registers capable of latching on both
edges of a control pulse must be designed.15
23Event Control Primitives
In order to handle two-phase control signals as described above, theinterconnect
needs some primitives capable of handling events. The most importantprimitives include
the event AND and the event OR elements.
(1) Event AND: If the inputs of an event AND match state, it copies the state to the
output. Otherwise, the previous state is maintained. Thus ANDingof two events is possible
by using a Muller-C element. The logic symbol of a Muller C-element is shown in Fig.2.5.
B
Figure 2.5. A Muller-C element
(2) Event OR: The OR function of events is relatively easy to implement. When
either input of an XOR gate changes state, its output changes state. Hence an XOR gate can
be used directly for event OR control.
2.4Micropipelines
Pipelining is a common paradigm for high speed computation. Pipelining is similar
to an assembly line where parts of the assembly are handled by varioussections. For exam-
ple, workers on an automobile assembly line perform small tasks, such as installing seat cov-
ers and fixing mirrors. The power of the assemblyline comes from the fact that many work-
ers perform small tasks to collectively produce many cars perday. Note that pipelining does
not reduce the time to produce one car, but it increases the number of carsbeing built simulta-
neously and thus the rate at which cars are produced.16
As applied to the ubiquitous microprocessor, pipelining is an implementation tech-
nique in which multiple instructions are overlapped in execution. For example, let an add
instruction be encountered in an instruction mix along with several other instructions. All
instructions encountered need to be fetched, decoded, executed and the results thus obtained,
written back. By pipelining, it is possible to fetch the next instruction while the previous one
is being decoded and so on. This increases the throughput of the system. In order to execute
the add instruction that was encountered, an adder would invariably be used. If the adder
itself could take data from only one instruction and work on it, two add instructions encoun-
tered one after another in the instruction mix would cause bubbles in the pipeline. The
bubble results from the fact that access to the adder is delayed until it has completed the data
manipulation on the previous instruction. A common solution to this problem is to design
the adder so that it can have data from many instructions resident in it at any given time, and
is capable of working on many such pieces of data simultaneously. This is pipelining at a
level lower than the instruction level and is hence called micropipelining.
If the various stages of a pipeline are forced to execute in lockstep with a global clock,
the result would be a synchronous pipeline. On the other hand, if each stage produced its
own request and acknowledge signals, an asynchronous pipeline would result.Various
asynchronous pipelined implementations exist and have been described in the literature. We
recommend [1], [12] and [13] as examples. Sutherland in [1] describes a transition signal-
ling and [12] implements it using ECDL, whereas [13] employs the four-phase scheme. We
will examine the transition signalling implementation in detail as it is the strategy used for
the work presented here.
What may not be obvious is that all micropipelines should follow what is known as
a bundled data convention. The bundled data convention [1] requires that the control and
data lines be treated as a single bundle, and data should always be available before the control
signal. Fig. 2.6 depicts a micropipeline using the transition signalling framework.Rin A 1 A3 Rout
I)
Data E 1110,
F
Ain
IIII..1.../...
(DELA I (DELAt
A2 R3
Y
Figure 2.6. A Micropipeline using transition signalling
Aout
17
Before we begin a formal explanation of the signal propagation, a word about the
double edge triggered flip-flop(DE11-1-9 is in order. The double edge triggered flip-flop can
be considered to be two D flip-flops arranged such that data can be latched at both the rising
and falling transitions of the control signal. Efficient implementations of the DETFF have
been proposed and tested by Lu [14].
The operation of the micropipeline can be explained as follows. Each loop of the
control signals as can be seen from Fig. 2.6, has a single inverter and hence can oscillate.
If each of the Muller C-elements is at the same initial state, then a transition in the input Rin
will propagate all the way to the output as follows. A change in the sense of Rin will cause
the output of the Muller C-element to change. This transition is used by the DE1 I-' I-' to latch
the input data. The signal propagates through the DETFF and is used as the Ain signal for
the previous stage indicating acceptance of the data. The LOGIC completes the computa-
tion. On completion of this computation, a request signal is to be generated for the next18
block. TheDELAYelement comes into play at this time. TheDELAYis adjusted in such
a way that theAinsignal propagates through it and appears as a Request for the next stage
after the logic has completed its evaluation. This signal is fed-back through the DETFF be-
longing to the next stage to be synchronized with the next transition at theRinline.
As can be seen, theDELAYneeds to be long enough to account for theLOGICblock
delay. In other words theDELAYelement is present to ensure that the bundled data conven-
tion is satisfied. Traditionally thisDELAYelement has been a problem. It is extremely diffi-
cult to predict the delay time as it depends on the depth of the logic thus complicating synthe-
sis. Furthermore, it is very difficult to predict the behaviour of theDELAYelement based
on process variations. An easy solution to this problemis to eliminate theDELAYelement
from circuits altogether. To this end, differential logic families offer a solution.If the
DELAYelement is still used, it can be made a tunableDELAY.Hence variations in process
parameters can be accounted for by just changing theDELAYvalue by tuning it, may be by
changing its operating point by changing a bias voltage. The work presented here makes use
of such a tunableDELAYelement.
2.5Completion Signal Generation
One of the best solutions to the meta-stability problem in asynchronous circuits is to
derivation of completion signals from the data itself. This can be done if differential logic
families are employed for the purpose of computation. We will discuss the process of com-
putation signal generation usingDCVSLandECDLbelow.
2.5.1DCVSL(Differential Cascode Voltage Switch Logic)
The operation ofDCVSLis illustrated in Fig. 2.6. When the signal I (initialize) is
low, the NMOS tree is cut off and the two output nodes get precharged. When I goes high,
one of the output nodes gets discharged conditionally, depending onthe NMOS logic net-19
work. The key feature is the double railcoded nature of this logic. The outputs areinitially
pulled down to a '0' during the prechargephase. After computation is complete, the outputs
take both the rails depending on thelogic function evaluated. Thus simplyORing the two
outputs would produce a reliablecompletion signal. A '1' at the output ofthe OR gate indi-
cates that the logic has completedits evaluation.
Output
VDD VDD
NMOS Tree
VSS V
Figure 2.7. DCVSL Gate Structure
Outpt
Complete
When an algorithm is partitioned into anumber of cells, each cell typically comprises
a number of suchDCVSL cells connected in series. Only inthe end of the series need there
be a connection between the DCVSL gateand the handshaking circuit. Typicallythe Rin
signal is connected to the input I, at theinterface. Thus the DELAY element depictedin Fig.20
2.6 has been effectively substituted by a DCVSL gate, which simulates the exact delay.
However this completion signal is a four phase signal. Additional hardware is required to
convert this to two-phase and use it in a two-phase system. The usageof the OR gate provides
a hidden advantage. A finite delay time will elapsebefore the data is valid and the OR gate
switches to indicate completion. This combined with the wiring delay, can account for the
setup time of the register to which this output is connected to, in caseswhere this signal is
used to drive the clock signal of a flipflop.
2.5.2ECDL (EnableDisable CMOS Differential Logic)
ECDL circuits are characterized by two distinct statesenabled and disabled, the
states being determined by the transition of the control signal I. Figure 2.8 depictsthe gener-
al structure of an ECDL gate. The signal I, represents the completion signal issued by the
preceding stage and Done, is the control signal that would be issued by the current stage, after
all logic processing connected with it are completed.
During the disabled state, that is, when I is at logic ' 1 ', out and out are reset to '0'. When
I is pulled down to '0' by the previous stage after it has finished its processing, this stage is
enabled. Depending on the N network, the outputs get set. Since the outputs are comple-
ments of each other there will always be an imbalance between the outputs afterevaluation
is complete. This condition can be utilized to feed a NOR gate which produces the actual
completion signal Done. Multiple output functions may be generated using multiple ECDL
gates.
Methods to produce AND, OR and invert functions of events using ECDL gates are
discussed in [12]. This paper also exemplifies micropipelines using two-phase signalling
schemes but a four-phase logic family.
In differential gates such as DCVSL and ECDL, the complexity of the NMOS logic
tree is usually less than twice the complexity of its single ended counterpart. This is acritical.21
factor, as the overhead incurred in using differential logic may not be justified if the NMOS
tree is not very complex, i.e., the logic to be implemented is not deep.This is a tradeoff where
gate count is traded off to get reliable completion signals.
Figure 2.8. Structure of an ECDL gate
-<I22
CHAPTER 3. ARCHITECTURE OF THE PARALLEL HALFADDER
The addition of two operands is the most frequent operation in any arithmetic unit.
A two-operand adder is not only used for addition and subtraction, but also for multiplica-
tion and division. Consequently, an efficient two-operand adder is essential. Adders differ
in the way they handle carries, and for the most part, the method employed for carry propaga-
tion is what distinguishes their performance. The two-operand adder in its simplest form,
ripples a carry through out the span of its bits to perform addition. This implementation uses
the least amount of hardware. Various other methods exist, which employ ingenious ways
to reduce the carry propagation delay and speed up the addition process. However, this in-
creased performance comes with an increased hardware cost.
As of even date, the fastest known adder implementation uses a carry lookahead
scheme. Unfortunately it is also the most hardware consuming. Hence the best adder would
be one, that approaches in hardware complexity, a Ripple Carry Adder and approaches in
speed, a carry lookahead adder. This chapter of the report describes the key features of a
Parallel Halfadder, which we have implemented. It also describes the features that the PHA
borrows from some existing addition schemes, to make it approach the speed of a Carry
Lookahead adder and posses the hardware complexity of a Carry Ripple Adder
3.1Addition Schemes
A vast volume of literature can be found on the subject of addition. This section of
the report examines the schemes of addition which serve as good limits on the efficiency
spectrum. The addition schemes discussed in this section include Carry Ripple Addi-
tion(CRA), Conditional sum addition(CSA), Carry-Completion sensing Additon(CCA), and
Carry Look Ahead Addition(CLA). The interested reader is referred to [15] and [16] for more
addition schemes, like Carry Skip Addition and a detailed explanation of what is provided23
here. Parallel Halfadder addition which uses features from CLA, CRA and CCA will be
discussed in the section 3.2.
3.1.1Ripple Carry Addition
To add two n-bit numbers, the straightforward implementation is to have n full ad-
ders. A full adder accepts as inputs two operand bits say Ai and Bi and an incoming carry
bit Ci and produces as outputs a sum, Si and an outgoing carry Ci+/. This outgoing carry from
stage i is the incoming carry for stage i+1. The corresponding logic equationsimplemented
by the full adder are :
Si = Xi ED yi (E) Ci
ci +i =+ ckzi + y)
Figure 3.1 depicts a CRA.
A 3
V
FA
C3
B2
FA
S2
C2
Figure 3.1. Carry Ripple Adder
(3.1)
(3.2)
FA FA-4111-
Cl Co
Si So
The carry-in Co needs to ripple from the least significant (LSB) position to the most
significant bit (MSB) position. Thus the name ripple-carry adder. If a term AFA be defined24
as the operation time or delay of a full adder, then to produce an n bit result the time taken
by a RCA would be nAFA. The term AFA comprises a two level delay. Namely the time to
compute the term AiBi and Ai +Bi in parallel and one gate delay to calculate Ci(Ai + Bi).
If one gate delay is defined by the term Ag then the total delay of an n bit CRA would be 2n4g.
3.1.2Conditional Sum Adder
Another scheme for fast addition can be attributed to Sklansky[17].Instead of
waiting for the carry to ripple through the n bits of the adder, the CSA splits the bits into
groups. Each group produces two sums and carry outs, one assuming an incoming carry of
1 and another assuming an incoming carry of 0. Obviously, by just having all the n bits
lumped as a single group results in a full carry propagation time. Hence the most logical way
to divide the adder would be to divide the n bits into two groups of n/2 bits each and then
again divide the n/2 bit group into n/4 bit groups and so on. This process can continue until
the number of bits in the last group is one, provided n is a power of 2. Since repeated division
by two is effected, the number of steps required to implement this process is log2n. The final
sum can now be obtained by just selecting from a number of sum results available based on
the carry-in at the lowest level and the intermediate carry-outs produced at each level. The
gate count of such a conditional sum adder has been shown to be 3n[2 + log2(n + 1)] by
Sklansky in [18]. The total delay time to compute the sum of n bits has also been shown to
be [2 + 2 log2(n + 1)].4g.The extra hardware cost depends on the number of groups and
the multiplexing cost to select one of the two results based on the carry-out in each step of
the addition process.25
3.1.3CarryCompletion Sensing Adders
CCAs[16], use n full adders to sum two n-bit numbers just as a normal CRA [16].
The difference is that the carry-out of the i/th stage is not fed as a carry-in to the i th stage.
Instead the carry-in for each stage is independently generated in the form of a carry vector.
Looking at the input vectors for addition in any adder, it is possible to predict the
carry out from some stages. If both inputs for any stage are ls, then irrespective of the carry
in the carry-out will always be a one. Such a combination is called a generate term. Similar-
ly any stage which has both inputs as zeros can only produce a 0 for carry-out irrespective
of the carry-in. Such a bit pattern is called a carry sink. Stages which have one input bit '0'
and the other '1' will always pass the carry-in to the carry-out. Such stages are called propa-
gate stages. It can be seen that these propagate stages introduce a delay in carry rippling.
The carry-outs for other stages can be generated just from the input patterns.
Once the carry propagation is complete, the sum can be computed in 24 g time units,
which is the time it takes for a full adder to calculate its output when all the input bits are
available.
Employing the carry completion scheme speeds up the addition process as follows.
The carry generate process for all the stages takes place in parallel, and some carry bits are
set this way. Carry propagation must be done between any two stages where the carry has
been generated. The length of the carry propagating segments depends on the input bit pat-
terns. This yields partial parallelism. Hence the total carry propagate time for all segments
will be the time it takes for the longest segment to propagate a carry. Burke, Goldstein and
Von Neumann [19], have shown that the upper bound of the longest carry length of two n-bit
binary numbers is given by log2 n. The ratio of the worst case to the average case increases
as n becomes large. Consequently the CCA becomes a better alternative as the number of
bits to be processed increases.26
Such an adder is characterized by a gate count of 17n1 gates. Sklansky [18] pre-
dicts that the lower bound on the delay of such an adder is given by (n + 4).4
3.1.4Carry Lookahead adders
CLAs, carry the idea of carry generation and propagation one step further than
CCAs[15]. The CCA still ripples the carry within its groups where carry needs to be propa-
gated. The CLA does it all in parallel. We define the Generate and Propagate terms that were
introduced in the previous section formally here.
G, = A. B, (3.3)
Pi = AiED B, (3.4)
Hence at any stage with a carryin of Ci the carryout can be determined to be
= A,B,+ C,{Ai+ 13)or
= G, + CiPi
By the same token, we have that
C, = G,_1+ C,_,Pi_,
Substituting (3.7) in (3.6), we obtain
= G. + G,_,P, +
Further substitution yields
= G. + Gi_iPi+ Gi_2P,_iPi+ + CoPoPi....Pi
(3.5)
(3.6)
(3.7)
(3.8)
(3.9)
From equation (3.9) it can be seen that all the carries can be calculated in parallel, and hence
there is no need for a carry to ripple through all the bits.
One of the main problems in using a CLA which implements Eqn. (3.9) is fan-in. If
the number of bits becomes large, then a number of high fan -in gates are required, and this
is a definite problem in VLSI implementation. One method to circumvent this problem is
to make use of Block CLA adders(BCLA). Figure 3.2 shows a BCLA for a 16-bit 2 level
carry lookahead implementation.x15-12Y15-12
11T
Group3
G3P*
c,
.315-12
4-
x11_8
Ilr
Y11-8
Group2
P2*
,
11-8 3
-
Y7-4
Groupl
Pi*
S7_4
4*
Y3-0
Group()
Po*
3-0
CLA GENERATOR
27
4--
Co
G**11! Irp**
Figure 3.2. A 2 level 16-bit CLA adder
G* and P* indicate group-generate and group-propagate terms. The group-generate and
group-propagate terms are used like carry-ins in single input cases. G** and P** indicate
the section-generate and section propagate terms. These terms can be used, if for example
a 64bit adder is to be constructed using 4 similar adders as shown in Fig. 3.2
The block CLA adder is popular among VLSI designers due to the fact that it is very
modular in construction and cascading to any number of bits is easily possible. Modularity
saves area in VLSI implementation as well. The speed of a carry lookahead adder is traded
off for fan-in in the case of a large number of input bits. Hence the speed and gate count
of a CLA adder are directly proportional to the fan -in allowed at the gate level. A straightfor-
ward calculation of the gate count and computation time are difficult for a CLA. Sklansky
[18], has come up with the following figures for gate count and computation time. All the28
calculations are based on the assumption that only 2 input gates are used.
p2 Gate Count = 6n + 1 + + 2p 1 P4
2
] P _-4- lq+ k(n+
1 [ 1)(p1)
where
(3.10)
k = log,[1 + n(p1)] 1 (3.11)
andq = 1 + (n1)pn (3.12)
The computation time is given by,
[4 + k(p + 1)]d , (3.13)
Having discussed a few addition schemes, we will now describe the Parallel Halfad-
der(PHA), and list the features that it borrows from these adders to make it approach the
hardware complexity of a CRA and the speed of a CLA.
3.2The Parallel Halfadder
The original idea of Parallel Halfadder addition was conceived by David B.
Swink[20]. We have developed an architecture, evaluated the hardware issues and have
characterized the performance of such an adder here. The practical use of such a PHA has
been exemplified by using it in a self-timed divider.
On considering the i th bit of any adder, it can be seen that four variables are of impor-
tance. These are the two operands Ai and Bi and the carry-in and carry-out, Ci and Ci+1
respectively. Once Ai, Bi and Ci are available, the value of sum and Gil./ can be calculated.
Let two sequences of numbers as shown in Fig. 3.3 be considered. The first bit set needs no
carry propagation, and the result is available just by assuming all ci terms are 0 and applying
equation (3.1). Using (3.2) the carry out from the adder can be determined. In the second
case, carry propagation must be done to some midway point, the second least significant bit
in this case. Tremendous amounts of time are wasted by modelling the worst case carry prop-
agations in both cases, i.e., by waiting, assuming that a carry will propagate from the least29
to the most significant bit. The CRAwaits for such carry propagation. The CLA on the other
hand uses hardware to predict the carries in each stage andconsequently faster.
As shown in Fig. 3.3, both the CLA and CRA are inefficient.There is neither a neces-
sity to predict the carries nor wait for carry propagationin the first example, and both these
need to be done only for a few bits in the second example. Hence amethod which determines
the point till which carry propagation is needed is intuitivelybetter. The CCA makes use
of this philosophy to some extent. It also uses the idea of generateand propagate terms to
determine the carry vector. Since it works on producing a carry vector,it takes a finite
amount of time to determine the carry vector evenif no carry propagation is needed.
1
1010 101
+ 0101 + 0 0 1 0
1111 1 1 0 1
a. No carry propagation b. Carry dies after second bit
Figure 3.3. Carry Propagation for two 4bit numbers
An examination of the propagate term of Eqn. (3.4) suggests that, it is, initself a sum
without a carry-in taken into account. So effectively in generating a propagate term,the
summing portion of a half adder has been used. Two half adders can makeafull adder and
the full adder can produce a sum after taking the carry-in into account.Hence, in principle
it is possible to determine the sum of 2n bit numbers by initially XORingthe input bits and
then iteratively adjusting the final result until all the carry propagation hasbeen accounted30
for. Besides, the carry vector generated for adjustment can be continuouslymonitored and
the point where no more carry propagation is needed can be easily determined.The parallel
half adder uses the ideas mentioned above.
By using the principle of iteration in time, the most obvious advantage is thehard-
ware saving of one half adder per bit of inputdata. The inquisitive reader might question
that iteration necessitates intermediate storage, and the hardware cost of the 2 nbit registers
needed for iteration would exceed the hardware cost of n half adders. This is true. But,the
hidden advantage of using registers, is in pipelining of the adder.
The PHA accomplishes iterative convergence by using 2 n-bit registers, one called
the SUM register, and the other the CARRY register. The register naming isappropriate,
as the sum converges to its final value in theSUM register and the CARRY converges to a
'0' in the CARRY register. The steps that are executed to accomplish such a convergence
are as follows:
1. The addend and the augend are loaded into the two registers, SUM and CARRY.
2. The SUM and CARRY registers are simultaneously bit-wise ANDed and XORed.
3. The XORed result is routed back to the SUM register and the ANDed result is left shifted
once, padded with a LSB zero and routed back tothe CARRY register.
4. The CARRY register is checked. If it is a zero, the process is complete and the SUM is
available in the SUM register. Else steps 2 and 3 are repeated until the CARRY register ze-
roes.
Figure 3.4 lists a C program that simulates this algorithm.
The zero detection signal serves as the signal to flag the operations in the adder. A
zero value at the output of the zero detector signals nooperation, and a non zero value indi-
cates computation. This makes the scheme self-timed, i.e., the convergence ofthe adder is
dependent on the input bit patterns. Such an adder is very difficult to model synchronously.31
Parallel Half Adder
++ Adapted from David. B. Swinl{20] +
#include <stdio.h>
main(argc, argv)
int argc;
char *argv[];
{
/* Get the command line arguments for the addend and the augend*/
}
int sum =atoi(argv[1]); /* First operand */
char mode = argv[2][0]; /* Add?Subtract */
int carry = atoi(argv[3]);/* Second operand*/
int sum_l = sum;
int carry_2 = carry;
int temp; /* Temporary operand save */
int i = 0; /* Counter */
printf("Step SUM CARRY RESULT\n\n");
while (carry != 0)
{
temp = sum ^ carry;
carry = (sum & carry) « 1;
sum = temp;
printf("%2d :%7X + %9X = %9X \n",
++i,sum,carry,sum+carry);
Figure 3.4. C program that simulates thePHA[20]
The operation of the PHA can be understood as follows. The bit wise XORing of the
operands produces a sum without taking into account the carry-in at any stage. When the
carry-in to the adder is zero, the only carries that need to be accounted for are the ones that
are internally generated. Such internally generated carries per iteration are determined by32
the bit wise ANDing of the operands. This internally generated carry is repeatedly adjusted
until all the carry propagation is completed. The bit wise XORing of the SUM and left shifted
CARRY register adjusts this. The end of adjustment is signified by the zeroing ofthe
CARRY register.
Figures 3.5 and 3.6 show the computation results obtained by compiling and execut-
ing the code of Fig. 3.4. Figure 3.5 illustrates a fast case in which no carry propagation is
essential and the result converges early. The second example, is one in which the carry needs
to propagate for a long distance and hence the number of iterationsrequired for convergence
are large.
On an average, the carry propagation for n bit operands has been proved to be equal
to log2n [19]. We have run software simulations and determined the averagenumber of itera-
tions required to attain convergence for 16-bits to be equal to 3.89. Hence, if the delays en-
countered per iteration are small, theoretically it is possible to achieve an average conver-
gence time comparable to a carry lookahead adder.The following section presents the
architecture of the PHA. Based on the architecture, it will be possible to calculate the per
iteration delay and the per bit hardware cost associated.
StepSUMCARRY RESULT
0:3042 +8704=B746
1:B746+ 0 =B746
Figure 3.5. Software Simulation of the PHA: Fast Convergence33
StepSUMCARRYRESULT
0:2E+7FD3=8001
1: 71.1-0+ 4 =8001
2: 7FF9 + 8 =8001
3: 7FF1 + 10 =8001
4 :7FE1 + 20 =8001
5: 7FC1+40 =8001
6: 7F81 + 80 =8001
7: 7F01 + 100 =8001
8: 7E01 +200 =8001
9: 7C01 +400 =8001
10: 7801 +800 =8001
11: 7001 +1000=8001
12: 6001 +2000 =8001
13: 4001 +4000=8001
14: 1 +8000 =8001
15: 8001+ 0 =8001
Figure 3.6. Software simulation of the PHA: Worst Case convergence
33The Proposed Architecture
The asynchronous PHA design, when looked at from a high level, needs 5 control
signals in addition to the data lines. These are Rin1 the request signal that comes in bundled
with the data line containing the addend, R1n2 the control signal bundled with the augend
data, Aour the acknowledge issued by the succeding stage indicating that the previous result
produced by the adder has been consumed, Rout the signal produced by the adder to the next
stage indicating that that the sum bits are available and Ain the signal producedby the adder
indicating that the input operands have been consumed.
Figure. 3.7 depicts the architecture of the adder. The two registers, SUM and
CARRY, are n+1 bit wide for 2 n-bit operands. The n+1 Th. bit is used for producing the
final carry-out from the adder. The SYNC block is used to detect if all inputs are available
and the previous output produced by the adder has been consumed by the adder's succeeding34
stage. The zero detector is used to determine the state of the CARRY register. The same
signal can be used to select if new data needs to be latched, or if the intermediate iteration
results are to be recirculated. The CONTROL block generates the internal clocking and con-
trol signals for the SUM and CARRY registers using the zero detection signal. The COMP
block, which comprises a plane of XOR and AND gates, does the actual computation. The
result of the addition is available from the SUM register after the carry register zeroes. The
n+1 th bit of the SUM register holds the carry of the addition operation. Completecircuits
to implement the adder are provided in Sec. 5.4.
:
rif
:- SYNC
1-*
CONTROL
4
AOUt
Rout
i
r
Rin 1 11/ ADDEND AUGEND
-leD. /'. n
«1
MULTIPLEXERS
1
SUM CARRY
ZERO DETECT
XOR, AND
I
Left Shiftn+1SUM & CARRY
Figure 3.7. The architecture of a PHA
t
Data
Control35
With the block level description as provided above, it is possible tocalculate the hard-
ware cost required to implement the adder. Anapproximate figure for per iteration time
delay can also be estimated.
3.4Including Subtraction
The PHA implements 2's complement addition. Hence, to subtract one operandfrom
the other, the subtrahend needs to be complemented, and a ' 1' added to theLSB. To allow
this enhancement, a new one bit register, namely the MODE register needs to beadded. The
peculiar problem surrounding subtraction here is the fact that the PHA does not support a
carry-in along with the two operands. If such a carry-in were present, theproblem of subtrac-
tion would be reduced to complementing one of the operands, forcing a carry-inand per-
forming a normal subtraction. Since this is not possible in the case of the PHA, it is necessary
to introduce the carry-in in the second iteration. This is doneby introducing a' 1' during the
carry shifting in the second cycle. This poses twoproblems. The first is the introduction
of the carry exactly in the second cycle, this being an asynchronous system. We havesolved
the problem by using a state machine which produces the selection signals for a MUX,which
selects either a '0' or a ' 1' for shifting in depending on the MODE register andthe output
of the said state machine. The second problem is the more crucial of the two. Since the carry
introduction is delayed, in the case of some bit patterns it is possible that the CARRY register
zeroes in the first iteration. The zero detectorwould then detect the process as complete and
the SUM register would actually not have converged to the right result. We have solvedthis
problem by modifying the zero detector and the control logic. These are shown in the chapter
on implementation in Figs. 5.2 and 5.5. The restof the process remains the same. This will
complete the operation of subtraction.
However, the final carry-out produced by the n+1 Th. bit of the adder must be in-
verted to determine the carry-out to the next stage. This can be done by selectively inverting
the carry -out depending on the value contained in the MODE register. Translated into hard-36
ware, this warrants the inclusion of an XOR gate controlledby the MODE register and the
n+1 Th. bit of the sum output.
Inclusion of subtraction is particularly relevant here, since division of two numbers
is performed by repeated subtractions. Figure 3.8 shows the software simulation of the adder
with this subtraction capability included. We chose this example as it demonstrates the prob-
lem of zeroing before actual convergence as discussed above.
Step
0 :
SUMCARRYRESULT
= 15B9C 5BAO
1 :1414141-6B A2 2= 14141-1:5BA4
2 : 141414145B AO 4= F1414145BA4
3 : FFFF5BA4 0= FFFF5BA4
Figure 3.8. Subtraction of 65532 from 23456
3.5Overflow detection
Since the PHA does not propagate carries horizontally, the problem of detecting an
overflow is noticed. The overflow detection mechanism usually uses an XOR gate con-
nected to the carry-in of the last stage and the carry-out produced from the last stage. Hence
the problem of overflow detection is reduced to a problem of determining the carry-in at the
last stage in the case of the PHA, since the carry-out is already available.
Since the final sum is available after the adder has converged, and the original input
bits are known, determining the carry-in is reduced to a simple combinational logic solution.37
3.6Hardware consumption estimation
In the architecture of Fig. 3.7, the only blocks that are black boxes are theSYNCand
theCONTROLblocks. The rest of the blocks are self explanatory. Hence, 2n multiplexers,
2n + 2 registers, n +1 XORgates and n +1 ANDgates are essential. The zero detector is
usually aNORcircuit and until a more efficient method is presented in this report, it can be
assumed to be composed of a tree ofNORandNANDgates to compose a high fan-inNOR
gate. The corresponding hardware investment would be n 1 gates. The control logic in our
implementation uses 22 discrete gates and its construction is presented in Fig. 5.5. The gate
count forCONTROLblock remains a constant irrespective offan-in. This gate count is only
for the adder and the gates needed to control subtraction have not been included. This is a
reasonable assumption as we are providing comparison data for adders and we have indi-
cated the methods to complete subtraction only to present a complete analysis of all cases.
Table 3.1 summarizes and totals the hardware cost associated with the implementa-
tion of the PHA.
The main advantage of using the adder in a micropipeline is that the storage for every
pipestage is available for free. The SUM andCARRYregisters provide the needed inter-
mediate storage. Since this adder is designed for operation in a micropipeline, where storage
always alternates processing, the hardware cost associated with the registers can be ignored
as the registers needed to implement the micropipeline can be eliminated.
For an n-bit CRA, the hardware cost associated is 7n gates[16]. From Table 3.1, it
can be seen that the number of gates necessary to implement aPHAare only marginally high-
er than that of aCRA.This matches our objective of approaching the CRA in terms of hard-
ware complexity. The other half of the objective that still needs to be met is to rival theCLA
in terms of speed. The following section estimates the speed of the PHA.3.7Computational speed of the PHA
path.
38
From Fig. 3.7, it can be seen that the following blocks form a part of the PHAs critical
The Data Multiplexers
The Registers
The Zero Detector
The Computational Plane comprising XOR and AND gates
The Control unit generated delay
Figure 3.7 also shows that the computation plane and the zero detector work together. Since
the zero detector is the slower of the two, the computational delay can be left out from the
calculation of the critical path. Assuming afan-in of 6 for each gate in the zero-detector, the
propagation time associated with a n bit zero detector is log6 n. It should be noted that this
value is different than the value used for hardware complexity calculation. The worst case
hardware was assumed. As will be seen later in Chapter 5, the CONTROL unit uses a multi-
plexer similar to the data multiplexer used in the datapath. It will also be seen that both the
multiplexers are controlled by the same control signal. Hence, the design is datapath limited,
and it is enough to account for the logic delay of one multiplexer in the critical path calcula-
tion.
Description No. Of Gates
Computation (XOR and AND gates) 2n + 2
Zero Detect ( Worst Case assuming a binary tree imple-
mentation of Zero Detection NOR gate) n 1
Multiplexers to select Data 4n
Control Logic (One time Overhead) 21
Total Number of Gates for a n bit adder 7n + 22
Table 3.1: Hardware consumption estimation of a n bit PHA39
Assuming that the operand bits are uniformly distributed over their range, it has been
pointed out by Burks et. al. [19], that the average longest chain carry propagation in log2n.
The critical path of the PHA includes the following delays:
Multiplexing : 20g
Latching using a Double edge triggered flip-flop : lnsl
Zero Detection using 6 input NOR gates : log6n
Hence the total critical path is given by
((2 + 1og6 n)48 + 1) loge n. (3.14)
The computation time of the CLA, as indicated in section 3.1.4, is of the order of
log2n. Equation (3.14) suggests that the PHA is higher by a factor of log6n. Hence a cursory
analysis would suggest that the PHA will progressively lag the CLA in speed as fan-in in-
creases. Since the hardware cost and computation time of the PHA are known at this time,
and similar values are available for all the adders discussed earlier in this chapter, a compari-
son of these parameters can be made.
Characterization of any adder should at the minimum, include the area investment,
its computation speed and its processing power. The other factors that need consideration
include the modularity or the ease with which the adder can be implemented in VLSI, its
power dissipation and the number of transitions that are necessary to converge to a result.
In chapter 4 we will present efficiency equations based on the minimum considerations
stated here. Using these equations, we will further try to compare the PHA with all the adders
presented in this chapter.
1 For a 1.2g Technology40
CHAPTER 4. THE PHA: A PERFORMANCE EVALUATION
Having calculated the hardware complexity and the computation time of the PHA
in the previous chapter, we will introduce an efficiency model here to analyze its perfor-
mance. We will also compare the PHA to the other adders presented in the previous chapter.
All efficiency and hardware complexity estimation will be done assuming that the
adders are used in a pipeline. At this point pipelining implies two things. On the one hand,
each bit of the adder may be a stage of a pipeline and each stage comprising a bit may produce
completion signals after processing. This is the lowest level of pipelining. On the other hand,
all the bits of the adder are lumped and the adder is treated as a single pipestage with no pipe-
lining internal to its implementation. It is these types of adders to which we direct our atten-
tion to in this study. The structure of an adder in a micropipeline is shown in Fig. 4.1.
4.1An Efficiency model
The computational efficiency of an adder is directly dependant on two factors, name-
ly the investment Ion the adder and its computing power n[18].The computing power deter-
mines the operating range of the adder. The investment I is a function of the total hardware
complexity y and the total addition time p.
n 1 = 7 (4.1)
Calculation of the investment on the adder constitutes an interesting problem. The
two terms in the function for investment, namely the total addition time p and the hardware
complexity y differ from each other by several orders of magnitude. Typically, p is in the
order of a few nano seconds and the gate count is of the order of several hundreds. Hence
a function to calculate the investment on the adder needs to be determined.41
Control '
--,
+
Input
Register 1
+
Input
Register 2
it1 Data *
Adder
t
Output
Register
t
Figure 4.1. An Adder in a pipeline
4.2Investment on adders
Several different methods exist to calculate the investment term I in the efficiency
equation of the adder. The best function to calculate the investment is yet to be determined
and only approximate functions are available. We will try to indicate the various methods
of calculating I described in the literature, and calculate the efficiency of the PHA using all
the methods that can produce indicative results for efficiency calculations.
4.2.1The Area time model
The area time model due to Sklansky[l 8] is perhaps the simplest function that can
be used to determine investment, and is the product of area and time. Unfortunately this is
also the most inaccurate measure. Since the gate count is several thousand times greater than
the addition time, the product of area and time when determined seems to place undue bias
on the gate count. Consequently the Ripple Carry Adder seems to perform best, when effi-42
ciencies are calculated using this model. This is untrue. Many VLSI designers are willing
to trade off area for speed recently. This necessitates the investigation of other models to
determine investment.
4.2.2The log of Gate count model
Sklansky [18] has discussed the gate count and time delay of a number of popular
implementations of adders and we summarized the same in the preceding chapter. Assuming
only two input gates, it has been proposed that the area factor be described by log2 G, where
G is the raw gate count required to implement the adder. Thus, using Sklansky's model to
calculate the adder efficiency, equation (4.1) becomes:
log G
(4.2)
Sklansky claims that the figures of efficiency obtained using this model seems to compare
favorably with the proclivity of VLSI designers to invest.
As the fan -in of any adder increases, so does its gate count. Since a logarithmic scal-
ing of gate count is effected, this model seems to be biased towards speed and attributes lesser
importance to gate count. Hence, we feel that this is not a precise estimate for investment.
4.2.3The Area-time-squared Model
Another meaningful estimate of the investment on the adder can be obtained by the
product of the hardware complexity and the square of the computation time. The area-time-
squared bound is based on the information flow within a chip[21]. The application of this
bound modifies equation (4.1) as follows:
=p2G
(4.3)
The Area-time-squared bounds have been extensively used for the comparison of complex-
ity of algorithms for VLSI implementations[21].43
All the models presented above require the computation time and the hardware com-
plexity of adders to determine efficiency. The following section delves into the details of
calculations of these parameters.
4.3Determination of Hardware complexity and computation time
The time that an adder takes to compute a result is the time difference between the
instant when both the data inputs are available and the instant when the calculated result is
available.For asynchronous implementations this represents the time lag between the
instant when both Rini and Rin2 are available and the instant when Rout is issued. For a syn-
chronous system this is measured as a fixed number of clock cycles.
However, there exists no hard and fast rule which can be used to determine the hard-
ware complexity of adders. The usual method to determine this parameter has been the gate
count. Gate count is not an accurate indicator of VLSI real estate since it ignores wiring
areas. Besides, modularity of the design is also important to come up with a dense imple-
mentation with more gates per unit area. In fact, it is possible that a smaller area results from
an implementation with more number of gates, but with simpler interconnections and more
modularity, than one with a fewer number of gates but more complex interconnections. As-
suming a design implemented using a standard cell library, typically the routing area would
be as much as the cell area itself. This is a very valid assumption as the standard cell library
philosophy is followed by most CAD tools for auto routing. Hence it is reasonable to multi-
ply the gate count in any design by 2 to arrive at the approximate hardware complexity. The
multiplied gate count shall be denoted by G.
Lu [9] has proposed a technique which takes into account the transistor count as an
indicator of the hardware complexity. An additional parameter called the wiring equivalence
number has been used as an indicator of the the number of non-local signals that need to be44
connected to a particular logic cell. All the calculations for CRA, CLA, CSA and CCA have
been done for ECDL.
However, our discussion will be limited to the gate level, and the gate count and wir-
ing area will be used as indicators of hardware complexity.
Table 4.1. indicates the functions used to calculate the gate count and efficiencies of
various adders. The pure gate count shown in table 4.1 will be multiplied by 2 to get an ad-
justed value to account for the wiring areas.
Type Of Adder 2 input gate count Computational Delay
Time
Ripple Carry Adder in 2nd s
Carry Completion Adder 17n 1 (n + 4)48
Conditional Sum Adder 3n[2 + log2(n + 1)] [2 + 2 log2(n + 1)]z18
Parallel Half Adder 17n + 22 log2(n) (1 + log6(n) .z18)
Carry Lookahead Adder
6n +
where,
and
q
p 1p2 + 2p 1
[4 + k(p + 1)]d8
1 + q + p 1
k(n Pq +p1 1) 1)2
k = logp[l + n(p1)] 1
= 1 + (n1)pn
Table 4.1: Gate counts and speeds of different adders
The variables in all the calculations represent the following:
n: The number of bits processed by the adder
Lig: The propagation delay time of a single gate
All formulae assume two input gates. This simplifying assumption has been made so that a
common platform for comparison may be established and the figure of efficiency does not45
get distorted due to unbalanced improvements due to differentimplementations. Figure 4.2
depicts these relations graphically.
16000
14000
12000
10000
8000
6000
4000
2000
CSA
CLA
CCA
HA CRA
Bits 0 Bits
100150200250300 0 50 100 150200250300
Figure 4.2. Computation time and gate count of adders
4.4Efficiency calculations
The confusion surrounding the accurate judgement of the investment on adders, has
led us to use more than one model to calculate efficiency. None of the models presented in
this section are perfect and they serve only as general indicators of the efficiency trend.Efficiencies of Adders
101 102 103
Gates
Figure 4.3. Efficiency estimation of adders
102 103
Gates
46
Using the above values of gate counts after adjustments, Fig. 4.3 shows the curve for
efficiency of various adders with inputs varying from 1 to 256 bits using the area time
squared model and the log of gates model. The range from 1 to 128 bits is the most critical,
since we do not foresee a need for a fan-in more than that in the near future. The rest of the
curve is just provided to indicate the general trend.
Using both the models the standings are the same. The CSA seems to be the most
efficient and the CRA the least efficient. The PHA is posited in the center of the spectrum47
in both cases. The factor ruining the efficiency of the PHA is the fact that its speed is or the
order of (log2n).(log6n), the second part of the term, log6n, being contributed by the zero
detection network. However the efficiency models are not perfect and have their drawbacks,
some of which are listed in the following section of the report.
4.5Limitations of the efficiency model
One fact that stands out from Fig. 4.3 is the that the CSA enjoys the number one
position in the efficiency spectrum using both the models of efficiency calculations. Yet it
is not the most popular adder that is used by VLSI designers. This highlights one of the draw-
backs of both the efficiency models used here. The efficiency models are not sensitive to
layout issues like modularity.
The model for efficiency used here does not take into account special architectural
features like implementation using various logic families and using high efficiency and high
fan-in gates. Different logic families produce different types of improvements or degrada-
tion of performance. The PHA will gain considerable speed by using a logic family like
ECDL for its zero completions and the CLA will gain some speed if high fan-in gates are
used. These effects cannot be accounted for in our efficiency calculations.
Intrinsic gate delays have been used and the wiring delays have been ignored. This
fact is susceptible of a ready explanation. Wiring delays become important only in cases
where a global signal needs to be routed to different parts of the system. This would typically
be the case only for control signals which run over long spans within the chip. Since this
discussion centers around the logic processing throughput, it is reasonable to ignore the wir-
ing delays which usually are a small fraction of the logic delay. It is still an approximation.48
4.6Improving the Efficiency of the PHA
Table 4.2 shows the values of computation time and gate count for 8,16, 32, 64, 128
and 256 bit adders. For lower values of fan -in the speeds of the CLA and PHA are compara-
ble and the efficiency of the PHA seems to be aided by a smaller gate count compared to the
CLA and a comparable computation time. But as the fan-in increases the efficiency seems
to depend solely on the computational speeds. The computational speed of the PHA is given
by:
log2n (1 + (log 6n + 2)4 ,) (4.4)
n
Computation Time Gate Count
CLA PHA CLA PHA
8 5.5175 4.7408 98 78
16 6.7811 7.0948 236 134
32 8.0587 9.8357 559 246
64 9.3435 12.9634 1297 470
128 10.6319 16.4779 2960 918
256 11.9921 20.3793 6663 1814
Table 4.2: Gate count and computational speeds of the CLA and the PHA
The first component of this delay, log2 n, is contributed by the number of iterations
that need to be performed by the adder on an average, to attain convergence and the term
log6n, is contributed by the zero-detection delay. This is implementation sensitive and the
assumption underlying Eqn. (4.4) is that a zero-detector uses gates of fan-in equal to 6.
Hence, methods to speed up the adder should take both these facts into consideration and
reduce either or both of them to implement a faster adder. The next section of the report is
dedicated to methods which can improve the speed of the adder and hence its efficiency.49
4.6.1The Tandem Adder
The convergence of the PHA depends on the bit patterns of the addend and the au-
gend. Fig. 4.5(a) shows severe carry propagation where the carry is propagated through out
the span of the bits. Fig. 4.5(b) shows the addition of the same numbers, but in their one's
complement form. The result is the same, but the advantage of using the technique of Fig.
4.5(b) is that the carry propagation is reduced. This technique can be applied to the PHA
to reduce long carry propagations. If the carry propagations are reduced, the average number
of iterations required for the sums to converge would be reduced and hence the adder would
become faster.
111
v(/( is Complement 1 Mode Bit
0 1 0 13.----- 1 0 1 0
+ 0011-----+1100
1 0 0 0
ls Complement0 1 1 1
(a)
1 s Complement
1 0 0 0
(b)
Figure 4.4. Alternative addition of two numbers
By using two adders, one working on normal numbers and the other working on com-
plements, it is possible to attain fast convergence[20]. It should be noted that the worst case
carry propagation for one adder is the best case for the other. Hence one adder working on
complements can aid the normal adder and the result available at the SUM register of the50
adder that converges faster can be used. Thus, it is possible to reduce the number of iterations
necessary to ensure convergence.
As an example the addition of the numbers 65535 + 1 is a worst case for the normal
PHA. If the same thing were to be implemented using a tandem technique, the convergence
takes place in two cycles as shown in Fig. 4.5.
As can be seen, the tandem adder will always perform better than a single adder.
However, this improved performance comes at a very large hardware cost. The hardware
required to implement this type of an adder would be at least twice that of a simple PHA.
Besides it would be necessary to implement complementing hardware and hardware to
detect which of the adders converged first and route the result accordingly. This would
typically require n multiplexers, n being the number of bits, and a more complex control
logic to control the select lines of these multiplexers. Hence this option is not favourable
considering the hardware standpoint, though it is left to be seen if the improved speed justi-
fies this increased hardware investment.
StepSUMCARRYRESULT
0 :FFFF + 1 = 10000
1 :10000 + FFFFFNrr =10000
Figure 4.5. Tandem addition
4.6.2Hiding the Zero-Detect delay
From Eqn. 4.4, it can be seen that two terms affect the speed of the adder. The first
being the number of iterations, log2n, and the second, the zero-detect delay denoted by the
term log6n. The method indicated in the previous section can be used to reduce the total num-51
ber of iterations that the adder takes to converge. This section describes methods to improve
the zero-detect delay.
One way to reduce the zero-detect delay is to implement a high-speed zero detector.
But the speed up that can be gained this way is limited, since a finite delay contribution will
always be present to detect a zero, and this finite contribution gets multiplied by the number
of iterations necessary to ensure convergence. Our implementation of the zero-detect circuit
uses a high fan-in grounded pullup NOR gate. This implementation trades-off power
consumption for speed. The advantage however is the near fan-in independant behaviour
of such a gate. This is true since the pulldown is effected by using n N-transistors in parallel
and the pull-up effected by a single grounded gate P-transistor, where n is the number of bits.
Since the primary interest in our zero detector is the transition from a logic '0' to a logic ' 1',
the P-transistor size has been optimized to take this fact into account.
A more efficient implementation is one that eliminates this delay altogether. Since
it is physically impossible for the system to work without a zero-detector, and all zero-detec-
tors have a finite delay, the only possible solution to this problem is to hide this delay as much
as possible. To do this, the algorithm for PHA addition presented in Chapter 3 needs to be
slightly modified.
The algorithm uses a "predict complete" strategy. It is assumed that the adder has
converged to the right result after each iteration has completed and further processing is done
only after it has been ascertained otherwise. Considering our implementation and the archi-
tecture of the PHA presented in Fig. 3.7, this delay is seen when the control unit issues a
select signal to the multiplexers only after the zero-detect circuit has resolved itself. The
probability of this occurrence is only once per addition process. Since the addition takes
log2n iterations to complete on the average case, log2n 1 times an unnecessary wait on the
zero-detector resolution is noticed. Hence if a different strategy is involved, namely a "pre-52
dict incomplete" strategy, the prediction would be correct log2n 1 times and wrongonly
once, on the average.
To implement such a "predict incomplete" strategy, the extra hardwarerequired
would be a new register, say SUM1, equal in size to the SUM register. Irrespectiveof the
result of the zero-detector output, the control can issue a signal to the MUX to latchinternal
data after the AND/XOR plane has calculated the new values. After the zero-detectorhas
resolved, and should the assumption prove incorrect, the contents of the SUM register should
be rolled back one iteration. The SUM1 register helps this. The contents of the SUM1 regis-
ter should always lag the contents of the SUM register by oneclock edge so that such a roll-
back is possible. The constraints that are imposed on the clocking now are that the delay
between clock edges should be at least greater than the sum of the XOR/AND plane delay
and the MUX delay. The other constraint is that the zero detector delay should be lesser that
the sum of the XOR/AND plane delay, MUX delay and the latching delay of the DETFF.
The changes needed in the original architecture of the PHA to include such a capabil-
ity are shown in Fig. 4.6.
The roll-back needs to be effected every time the adder completes computation.
Instead of re-latching the data from SUM1 to SUM to effect this roll-back an easier method
would be to use the SUM1 register for all external purposes and the SUM register for all the
internal calculations. Thus after the sum has converged, the result is actually available in
the SUM1 register instead of the SUM register in this modified version of the adder.4
flout
Rout
t
r
Rin1 n.----- ADDEND AUGEND
2 tt
-44- RIP
clock
«1
MULTIPLEXERS
SUM
I
t
1-
T
53
CARRY
XOR, AND
Left Shift « /
ZERO DETECT
t
SUM1
n+1
SUM & CARRY
Data
Control
Figure 4.6. Modified Adder to support " predict incomplete "
.
Using this architecture, it is possible to hide the zero-detect delay almost fully. A
very small part of it will show up since the constraints that need to be met here for the clock
generation are stricter than in the normal case. Waiting will still have to be done when the
fan-in of the adder is extremely large, as the zero-detect delays in such cases, would exceed54
the sum of the delay of the XOR/AND plane and the MUXing delay. Using the values pres-
ented in Chapter 4, the sum of the XOR/AND and the MUXing delay adds up to 3A8,where
Agdenotes the unit gate delay. The delay of the zero detector is given by logfn dgwhere
f is the fan-in of the gates used to implement the zero detector (6 in our implementation) and
n is the number of bits in the adder. Hence the zero-detect delay will be fully hidden for cases
when logfn < 3 and it will start showing, by a value equal to A g(logfn3), when logfn
>3.
In our first cut implementation, the adder is capable of supporting addition and sub-
traction using a MODE control bit. The improvement strategies discussed above have not
been implemented. We propose to improve the design and incorporate these improvements
and implement an adder possessing a higher efficiency.
This chapter and the previous explained the operation of the adder, evaluated it and
suggested methods to speed it up. The facts that actually remain to be discussed are the strate-
gies to implement the adder in silicon and its use in a real application. The next chapter dis-
cusses ways to partition the adder such that it can be implemented in silicon with minimum
effort.55
CHAPTER 5. IMPLEMENTATION OF THE PHA
Once the design and the system level simulation of a system are completed, a number
of options exist to implement it in silicon. The solutions range from a full custom imple-
mentation to an implementation synthesized directly from a hardware description language
like VHDL or Verilog®. With the advent of ASIC CAD tools, the latter implementation is
rapidly gaining popularity with VLSI designers. This chapter discusses some of the imple-
mentation options available and describes the methodology adopted by us to layout the PHA.
5.1Implementation Strategies
Given the available tools, some options that presented themselves were full custom
layout or employing a place and route tool, V PNR(Vanilla Place and Route). VPNR is a tool
capable of routing a design captured using VIEWlogic Powerview, provided all the neces-
sary components are available in a standard cell library. It employs the Magic layout editor
to produce the final design.
A full custom layout allows a number of optimizations and yields the most compact
layout. The disadvantages are that it is time consuming and that any change in the design
after layout is started may be hard to implement and might involve a waste in terms of work
done. Using a place and route tool like VPNR, on the other hand simplifies the process of
layout considerably and produces a layout which is reasonably dense. A change in the design
at any point in the design cycle can be accommodated. The disadvantage however, is that
the user is completely at the whim of the tool and has little control over the way the layout
is completed. Only the most recent advancement in CAD tools facilitate the control of criti-
cal path behavior by adopting a timing driven layout methodology. The layout in such cases
is optimized based on the timing criterion rather than placement issues as has been done56
traditionally. VPNR, being a primitive tool, provided no such capability. Though the num-
ber of rows and columns in the final layout can be selected for an optimal aspect ratio, the
results obtained are far less impressive in comparison to a full custom layout. This is justifi-
able since the problem of placement and routing is NP complete.
One possible way to improve the final result obtained using a place and route tool
is to partition the design optimally. The design would have to be divided into a number of
blocks and each block routed separately and the blocks placed optimally and routed. Since
VPNR does not support a block routing capability, we have implemented a semi-custom lay-
out in which parts of the layout were generated using VPNR and the rest were laid out by
hand.
5.2Partitioning the adder
It was emphasized earlier that one of the desired characteristics of a design to be im-
plemented in silicon is modularity. Hence it is necessary to split the adder into modules
which can be easily duplicated so that an adder of any fan-in can be built. The best way to
ensure this is to bit-slice the adder. Fig. 5.1 shows a possible way to split the adder into cells.
Each cell of Fig. 5.1 represents one bit of the adder and the cell needs to be duplicated n+1
times for an n-bit adder.
The control unit generates all the signals, namely, Select, Clock and Mode. The
Carry-in of the current stage is the Carry-out of the previous stage. This takes care of the
left shifting that needs to be implemented to complete each iteration of the PHA. The Carry-
out bits of all such stages need to be connected to a zero detector to determine convergence.
The inset of Fig. 5.1 shows the cell from a high level with all the signals that need to be cas-
caded. The external input A is cascaded downwards so that a cell placed diagonally below
the cell can also get the value of External A. Though this feature is not essential for the imple-57
mentation of the adder, it is extremely important for implementing arithmetic operations like
division, where the divisor bits need to be cascaded downwards to all the stages.
Mode
Select
Carryout
Clock
External A External B
Select
Mode
Select
External A
MUX
Carryin
Select
Mode
Clock Clock
>IDETDFF 31DETDFF
SUM CARRY
Sum
Clock
...../
U
Quotient,
Carry -out
Figure 5.1. Partitioning the PHA into cells
Mode
Select
Carryin
Clock
External A58
53The Final Implementation
The cell as described in the previous section was implemented full-custom, in the
form of a bitslice. To implement a n-bit adder, n+1 such bitslices would have to be butted
against each other. The n+1 Th. cell is necessary since there is no other way in the algorithm
to determine a carry-out. The control unit is fan-in independent. Since the control unit is
the most likely to change down a design cycle, this was implemented using VPNR after cap-
turing the design in VIEWlogic® Powerview. The bitslices were implemented using the
cells from a standard cell library. The zero detecting NOR gate and the CONTROL unit laid
out using VPNR were all connected together to produce the final layout. The adder that we
have implemented follows Fig. 3.7 and does not take advantage of the speed up techniques
described in the previous chapter. The simulation results and detailed circuit diagrams de-
scribing the various blocks of the adder are presented in the following sections.
5.4Circuit diagrams
Figure 5.2 depicts the architecture of the PHA as captured using VIEWdraw (8). The
architecture closely follows the one described in Fig. 3.8. The register, Controlled_Rx, is
capable of complementing the input data depending on the MODE bit. The multiplexers are
present inside the controlled registers. The SUM[16:0] bus produces the actual computation
results and SUM[16] is inverted depending on the MODE register to produce the carry-out.
Mode '0' implements normal addition and Mode ' 1' implements subtraction. The Shifter
block is just a heap of wires shifting the bits once to the left. The least significant bit of the
shifter is determined by the carry select block, which provides control signals to a MUX to
select either a '0' or a ' 1' in the second cycle to implement a two's complement subtraction.
The expanded views of the Controlled register, The computation plane and the the CON-
TROL unit are shown in Figures 5.3, 5.4 and 5.5 respectively.59
A2N2Rini
OUT
IRIN1
RI6T2
InVirr1(15.0]
InrUr2(15.0]
IRESET
SOOT
COOT
Add.r
Control
Xanrc.(01116]
Dom,. lllll 01 0o.13,2 (.0
10=1..MCZ
Controllod_rm
ClaCqliF
C11.1411(
611X60._124017T(16.0]
INPUTIr16.01 MUM
MUM nUnrrrrn
ROUT
VDD
SIN
RESET
ODD
140/01( CLX
ant dlnirr(16.0]
ROUT
SELECT
TOLn-1.2116
WVnlo1.2116
U
axac :wow
owin1(11301
Controllod_rx
carry
It .01
SEIXFVED(16.0]
MODE 1M
rjg:E
ym"-
0'4NO0 N0
0
0
0
ri.111.o1.2166
TP1.110.4346
TPSID-0.40.5
ARMY OUT
N
/".
6171116
nnyaro,
rowm.o0.6
TPDH...0.6
XONCD(0.16]
NM/MD[0.16]
X
a
O
Figure 5.2. The PHA captured using VIEWlogic 860
OKKA1[115801
ssrs3 (16.01
..._los
litoU.
liditiell. r.pm
'
LEO
11C111111
'1111:1 .,' I '
..,
SE .
C I M1110 ...it=-
law
iI:MIMIMO ..=...I
(..=
irroiw
1140E1
NM CI .. a :
.1If ...... (= LEO
RP II= I
..
haul.4.,....
._
SE
..
KO
1110111 lig. .1 I
SE
II: ..1.
L
I " l
SE.IM
11112 , I :
AMMKOSE
II
AS 3::.LU
il
V r '..
LEQsr
1.50
SE
lap
EQS
OE
LKQ
BE
LEO
SE
LEO
ESE
LEO
SE
LICQ
SE
ODE
L MOT
SMOLT
Y.EQ
SE
LOCK
OUT(1650]
Figure 5.3. The insides of the Controlled Register61
ammmm.4.11
1.117.4,1411
AGM.
D
=D
AM02
D
IOWA:DIA
AX0iM1,11
X0R601.11
A3111.1M2.31
A00.1012
AMFOYS71 1
A101.010
AdA11710
AIIMNIP1
O
ADDAMDIA
i0lAD(0.1
D
:D
A.D.
. D
saroa
1.71.192
110AD A
111.4010
D°
D
1107{.01.
111.70=0
. D
AW02
DD.D. AM
AND].
AWDAD(OsIA]
AC,ILZ02
NOILISD
YORZO7
Figure 5.4. The computation planeFigure 5.5 CONTROL unit for the PHA63
Figure 5.5 shows the presence of two Muller C-elements in the CONTROL circuit.
These are used along with the Rin and the Ain signals to determine synchronization. Since
the generation of the Ain signal in the figure depends on the Aout signal from the succeeding
block, this is a non pipelined implementation of the interconnect.
As indicated in Section 3.4 subtraction needs the introduction of a '1' in the second
cycle. Since the specific signal names have been introduced earlier, the state diagram and
the circuit required to implement the state diagram are shown in Fig. 5.6.
5.5Simulation Results
The design presented above was simulated using VIEWlogic Viewsim. The same
examples that were simulated using the C program in Sec. 3.2 were chosen. Figures 5.7 and
5.8 present the simulation results for addition.
Figure 5.9 presents an example of subtraction of two numbers. The numbers used
here are the same as those used in Sec. 3.4.
Figures 5.10 and 5.11 present the SPICE simulations of the layout of the CONTROL
unit and one bit of the data path respectively.64
Input: Zero Detector Output
Output: MUX_SEL
Figure 5.6. Including Subtraction: Carry SelectRESET
RINI
RIN2
AOUT
ROUT
A.IN
INPUT 1
INPUT 2
C LIC TEST
SHIFT XNPU
SHIFTED
CARRY
SUN
I
9
i
t
a
1
3042
8704
00000
0000x 00006
00000 08704 00000
00000 03042 018746
30n On 50n 60n
Times( Set o ones )
Figure 5.7. Simulation of the PHA using VIEWsimRESET
RINI.
RIN2
AOUT
ROUT
AIN
INPUT 1
INPUT2
CLK TEST
SHIFT_INPU
SHIFTED
CARRY
SUM
1
ii
1
t
a
1
002E
7113
00000 00002 : 0000 00E0 0080 0200U 0800 2000 0000t3
0000XMOM0010. ooio 0.300 0400U1000 4000 0000d
00000 07FD3 : 0008 00E0 0080 0200U 0800 U 2000 U 8000 01000
00000 0002E: MCM111111GEOMM4117E01 7801 6001 0001' 08001
4 0 n 500 600
Time( Seconcla )
70n 80n 90n
Figure 5.8. Simulation of the PHA(Worst Case)RESET
RINI
RIN2
AOUT
ROUT
AIN
INPUT1
INPUT2
CLIC TEST
CARRY
20n 25n 30n 35n
Time( Seconds )
40n
Figure 5.9. Subtraction using the PHA5.0
4.0
3.0
2.0
1.0
0.
5.0
14.0
3.0
2.0
1. 0
0.
5.0
4.0
3.0
2.0
1.0
0.
0
SPICE SIMULATION OF THE CONTROL UNIT
...L..1... I ...L...... ......... I ...L..J...... I ...... I ...
El
L.1. iJ- --L J-1.1.J..1--1 -I-
PVrAil VitlUilrik011111141(
ilf1-1111111
9-Fir
1111.11n.1.110.1.1
01 1'11 111 L nth
11.1111:11
Tit
0.,11 01,
CONTROL_LAY
1111
Reset
CONTROL_LAY1
169
A
17
13
o---
0119
-
CONTROL LAY
1141111111111.111)1111-1Trirl.A144
1.1.1111111111115
,iiii CLK_OUT
111111111A
1,
II
II
-I.1 11111-11-114111141111
II11
I i 0111111
50.0N 100.0N 150.0N 200.0N
TIME [LIN)
Figure 5.10.5.10. SPICE simulation of the CONTROL circuit
250.0N 10.ON
300.0N5.0
it* SPICE FILE CREATED FOR CIRCUIT MULLERCRES
94/06/24 13103!51
4.50::/1
4.0:
V
0 3.50=
3.0=
2.50
N 2.0=
1.50;1
1.0:I
500.0M=1
I.
............. .....
BITSLICE1.T:
126
14oClock
Reset
V
0
L
T
N
5.0;
4.50=
4.0=
3.50
3.0
2.50
2.0
1.50
1.0
500.0M
1
.....
.....
.....
(1
.....
tri
I- I
I 1
...
t
1
i
DITSLICEI.T!
122
o Carry out
Sum
O.
O.
50.0N
1
TIME
100.0N
[LIN)
150.0N 200.0N
200.0N
Figure 5.11. SPICE simulation of the bitslice70
CHAPTER 6. A SELF-TIMED 16 BIT DIVIDER USING THE PHA
Of the four basic arithmetic operations, division is the most complex and is conse-
quently the most time consuming. Given a dividend X and a divisor Y, the outputs that are
of importance in a division operation are the quotient Q and the remainder R, satisfying the
condition:
X= Q.Y+R where R< Y (6.1)
The asynchronous PHA described in chapter 3, evaluated in chapter 4, implemented,
improved and simulated in Chapter 5 can be used in the construction of a self-timed array
divider. On the average, the divider would produce a quotient bit after the average computa-
tion time of the n-bit adder. This chapter presents the design of an array divider using cells
from the PHA.
6.1Array Division
All algorithms for division can be implemented using an array of cells where each
step of the algorithm is executed by a separate row of cells[15]. Thus, n rows of cells with
n cells per row will be needed to implement a radix-2 division algorithm. Division can be
performed using either a restoring or non-restoring algorithm. In restoring division, a differ-
ence between the previous partial remainder and the divisor is formed and the quotient bit
is determined based on the sign of the result. Should the sign prove negative, the partial re-
mainder is restored. This operation is unnecessary. Just by repeatedly shifting and choosing
an addition or subtraction operation depending on the sign of the previous result the same
result can be accomplished. The advantage of using a non-restoring array is the simplicity
with which it handles negative partial remainders. Figure 6.1 depicts a 16-bit array divider.-s4 4
-;C-) 5Y0 X0yi X1 Y2 X2 y3 X3 Y15X15X16X17X18 X31
-411111E--OP'
Remainder bits
Figure 6.1 A 16 bit array Divider72
Each Cell of the divider is one cell of the PHA. The Blocks labelled R, are simple double
edge triggered flip-flops. These flip-flops facilitate pipelining. These are necessary as one
of the operands is 32-bits long, and only the first 16 bits are used in the first row. The other
bits are consumed one by one as the execution proceeds downwards along rows. When a
Ain signal is issued to the preceding block, the data on this 32-bit input can be changed any-
time by the preceding block as the receipt of data has been acknowledged. This could poten-
tially result in a loss in data. On the other hand, not issuing the Ain signal to the previous block
until the process is complete, will result in a non-pipelined implementation thus resulting in
a drastic reduction in the usage of the rows and hence the throughput of the system.
The CONTROL blocks form the left periphery of Fig. 6.1 and generate the internal
clocking and synchronizing signals.
6.2The design of a 16-Bit Divider
As described in the previous section, a cellular construction of the adder can be used
to implement an array divider. Since our implementation of the adder in silicon is in the form
of bit-slices and the the control unit is also available, it is relatively straightforward to imple-
ment an array divider. To capture the design of the divider, and perform a system level simu-
lation, we have used the adder captured in Chapter 5 directly without any alteration. Hence,
each row of the array divider in our design looks like a single block. Any attempt to imple-
ment the divider in silicon with the cells and bitslices that we have developed would follow
Fig 6.1 more closely.
Figure 6.2 depicts the design capture of the divider. The pipelined nature of the divid-
er is clearly seen from Fig. 6.2, with all the PHAs connected to one another and data flowing
through each stage of the pipeline. Also, the absence of the global clock and locally gener-
ated control signals connecting one block to another can be noticed. Each PHA is a stage of
the pipeline, and there exist 16 such stages.DIVMSOR[15:0]
R
RENE
"Do
MODE
RINI
AIN
amponinriropinwN"merig
PRA "1111.. PHA PRA
76:
.M.e,ilia Ail .MYMII
mY,La Y-10
MENNIalalligliallialLAILal11.1111MMEINIMIIII
HA PALK PHA
I6
in V
, A
H
z z
W W
H
H H
H
A A
FINALAINO
ROUTN
REM(15.0)
MCDECRIT
RINCANCANE
1
s"...-""-" ' 7
..e..
MUM !EA MEN PRA !EA
PRA PEA
11111MINInAMMARPNIIIIIIIIII16111111
RAVI16 bit divider
2 Cyole Mioropipaline (16 stages of computation and intetroonnoot)
Figure. 6.2. Design capture of the 16bit divider74
A peculiar problem has been noticed in simulating the divider captured above. From
the circuit diagram for theCONTROLunit of Fig. 5.5, it can be seen that aRESETshould
introduce a '0' at the clock output line. The rest of the actions taking place in theCONTROL
unit depend on this initial '0' token circulation.It can also be seen that this '0' was
introduced by us by using a redundant AND gate at the clock output and connecting it to the
RESETline. The event driven nature of the simulator constrains that theRESETsignal
should be the one that changes last. If this condition is not met, the system goes into an un-
known state. We ensured this in the simulation of the adder by writing simulation impulses
keeping this in mind. However in simulating the full divider, this condition is impossible
to ensure using the simulation impulse file. Since the Rini of all the stages except the first
one are undetermined until the previous stage produces a Rout, the globalRESETsignal is
the first to change for all the PHAs except the first one. Thus even before all the values of
Rin are available theRESETchanges and the whole system goes into an unknown state.
However, we have verified the functionality of the design by splitting the reset line into many
parts likeRESET!, RESET2and so on and changing the value on eachRESETline only
after all the other changes in the circuit were stable. Since it is impractical to split theRESET
line into 16 different lines, we verified the functionality of the first few stages only, and found
the system to be functionally correct. TheSPICEsimulation of theCONTROLunit of Fig.
5.10 shows that, in practice any signal can change first and the system is still functional, and
the problem noticed here with theRESETline change is only due to the simulator and not
due to the functionality of the circuit itself. Thus the divider works even though we are un-
able to provide any formal simulation results here.
Division is the most time consuming operation of all the four basic arithmetic opera-
tions. This is due to the fact that the computation of any row in the array for division is depen-
dant upon the results produced in the previous row. Hence an array divider is used for the
soul purpose of its regularity and not due to any special speed up possible. Since the array
division process consumes large quantities of hardware, the usual scheme employed for75
carry propagation is the Ripple Carry Scheme. Very rarely are the schemes like CLA used
to speed up the carry propagation due to their prohibitive hardware consumption.
The PHA offers an effective solution to this problem by speeding up the addition pro-
cess, but causing an extremely small hardware increase. Thus the PHA is ideally suited for
operations like array division.76
CHAPTER 7. CONCLUSIONS AND FUTURE WORK
The preceding chapters of this thesis presented the design and development of a Par-
allel Half Adder and its application in a non-restoring array divider. This chapter summa-
rizes the work done and the conclusions. Directions for future research in this area have also
been provided.
63Conclusions
We have designed and implemented a Parallel Half Adder in silicon. The adder has
been designed for fabrication using a 1.21.1 process.
Since iteration in time is used, the hardware complexity of the PHA approaches the
hardware complexity of a simple CRA. The PHA compares reasonably well with a CLA in
terms of speed and proves to be faster than a CCA and CRA. In terms of overall efficiency
of the adder, the CSA proves to be the most efficient. The CLA is posited next and is closely
followed by the PHA.
The number of iterations that are necessary for the PHA to attain convergence de-
pend on the input data patterns. Modelling such a system synchronously would be extremely
difficult as the number of clock cycles necessary for convergence varies each time. Thus the
PHA represents a case where an asynchronous algorithm is mapped to an asynchronous sys-
tem. Since the results obtained in terms of speed are comparable to existing adders and the
hardware complexity lower, the PHA is a viable alternative to using existing adders for arith-
metic in self-timed circuits. It presents a reasonable balance between hardware complexity
and speed, making it a good choice for designs where some speed is necessary but the area
on-chip is not sufficient to incorporate a very high speed adder design.
The adder is particularly attractive for array division, since the result of each row of
the array divider is used for processing by the row immediately below. Thus the speed with77
which the addition or subtraction in each row is performed becomes a limiting factor in terms
of speed. Since the hardware complexity of the divider is already high, methods to speed
up carry propagation like CLA are not viable. Such a dependency does not exist in other
arithmetic operations like multiplication where all the partial products can be generated in
parallel.
The overhead incurred due to handshaking can be more that offset, if the algorithm
is partitioned appropriately, and the thread chosen to process between synchronization points
is sufficiently large. This is demonstrated by the pipelined implementation of the PHA,
where the cost of synchronization is a small fraction of the computation time. Thus if a sys-
tem is appropriately partitioned, it is possible to utilize the host of advantages brought out
by self-timing and not feel the cost paid for hand-shaking.
6.4Future Work
Our implementation of the PHA uses the simplest architecture and does not take ad-
vantage of the possible speed up techniques proposed. Future work should be directed in
this area to improve the adder.
The adder has been characterized here only partially. Attempts to find the probability
of occurrence of the worst case and a more detailed study of the carry propagation patterns
should be undertaken and accurate mathematical models need to be developed.
As a result of implementing this project, we have updated a standard cell library to
incorporate some of the basic primitives like Muller-C elements and Double Edge Triggered
flip-flops. A partial effort to provide entries pointing to such a library, in a design capture
tool has also been made. The library available as of now can just be used for design capture
and auto routing, and cannot be used for simulation. VHDL models need to be attached to
the components in this library so that they can be used for design capture and simulation as78
well. This is an important step, as a place and route tool is needed to complete full synthesis
of a design, and ongoing research in our group is directed at synthesis of self-timed circuits.
Another direction for active research should be the design of an interconnect to hook
up an asynchronous building block with a synchronous system. Such a hybrid architecture
can potentially utilize the advantages of both synchronous and asynchronous systems.
Since self-timing proves to be well suited for DSP, the use of this adder in DSP algo-
rithms should be studied. We believe that the adder could be best suited for such applications
since DSP IC chips are usually core limited, and area is usually a limiting factor prohibiting
costly investments like CLA adders. Since our adder provides a balance between speed and
area, its use in DSP chips could prove beneficial.79
BIBLIOGRAPHY
[1]I.E. Sutherland, "Micropipelines," Communications of the ACM, Vol.32 No.6,
pp.720-738, June 1989.
[2]D. Pountain, "Computing Without Clocks," BYTE, pp.145-150, January 1993.
[3]W. Wolf, Modern VLSI Design A System Approach, Prentice Hall, pp108-112, 1994
[4]N. Weste and K. Eshragian, "Principles of CMOS VLSI Design, A Systems Perspec-
tive," Addison Wesley Publishing Company, 1993
[5]E. J. McCluskey, " Logic Design Principles," Prentice Hall, pp.82-84, 1986
[6]D.B. Armstrong, A. D. Friedman and P. R. Manon, "Design of Asynchronous Circuits
Assuming Unbounded Gate Delays," IEEE Transactions on Computers, C-18, Dec.
1969
[7]F. U. Rosenberger, C. E. Molnar, T. J. Chaney and T. P. Fang, "Q-Modules: Internally
Clocked Delay-Insensitive Modules," Technical Report 4706, 4707,4708, Sutherland,
Sproull and Associates, 1986.
[8]T. H. Meng, "Synchronization Design for Digital Systems," Kluwer Academic Pub-
lishers, 1991, ch. 2, pp. 35.
[9]S.L. Lu, "Selftimed Arithmetic Structures in CMOS Differential Logic," Ph.D The-
sis, Computer Science Department, UCLA, 1991.
[10]L. G. Heller et al., "Cascode Voltage Switch Logic: A Differential CMOS Logic Fami-
ly," ISSCC Digest of Tech. Papers, 1984, pp.16-17.
[11]L. Merani, "Micro Dataflow A Dataflow Approach to Self-timing" Master's Thesis,
Electical and Computer Engineering Department, OSU, 1993.
[12]S. L. Lu, "Implementation of Micropipelines in ECDL," submitted to IEEE Trans. on
VLSI, 1993.
[13]G. M. Jacobs and R. W. Brodersen, "SelfTimed Integrated Circuits for Digital Signal
Processing Applications," VLSI Signal Processing III, IEEE Press, 1988.
[14]S. L. Lu, and M. Ercegovac, "A Novel CMOS Implementation of DoubleEdgeTrig-
gered FlipFlops," IEEE Journal of SolidState Circuits, Vol. 25, No. 4, August 1990,
pp. 1008-1010.
[15]I. Koren, "Computer Arithmetic Algorithms," Prentice Hall, 1993, Ch. 5, pp.71-92.80
[16]K. Hwang, "Computer Arithmetic," John Wiley and Sons, 1979, Ch. 3, pp.69-93.
[17]J. Sklansky, "Conditional Sum Addition Logic," IRE Trans. EC-9, No.2, June 1960,
pp.226-231.
[18]J. Sklansky, "An Evaluation of Several Two-Summand Binary Adders," EC-9, No.2,
June 1960, pp.213-226.
[19]A. Burks, H. H. Goldstine and J. Von Neumann, " Preliminary Discussion of the Logic
Design of an Electronic Computing Instrument," Instit. Adv. Study, Princeton, NJ,
1946.
[20] D. B. Swink, "Parallel Halfadder Additon," private communication, April 1993
[21]J. D. Ullman, "Computational Aspects of VLSI," Principles of Computer Science Se-
ries, Ch.2, 1984