An 180 MHz 16 bit multiplier using asynchronous logic design techniques by Burford, Richard G et al.
An 180 Mhz 16 bit Multiplier Using Asynchronous Logic Design 
Techniques 
Richard (3. Burford, Xingcha Fan and Neil W. Bergmann 
CSIROFlinders Joint Research Centre in Information Technology 
Flinders University, Adelaide, AUSTRALIA 
Abstract 
A CMOS digital logic design techtiiyue is described which 
rxploits the advantagtas offustprechargrd logic and eficient 
latch design commonly used in synchronous systems while 
niuintaining the features of localized control inherent in 
usynchronous design. A pipelined sixteen bit iniiltiplier 
is prosetired arid its performmice conrpared with sewrul  
previously reporrrd usyncht-onoiis (2nd synchronous designs. 
1 Introduction 
Arithmetic operators are often the major building blocks and 
performance limiting factor for application specific Digital 
Signal Processing (DSP) and other numerical data process- 
ing hardware. It is generally believed that asynchronous 
arithmetic operators are slower and occupy a larger chip ;rea 
than their synchronous counterparts and this is supported 
by several published studies [9]. We contend however, that 
with careful selection and design of control and data paths, 
asynchronous operators can be designed with perfonnance 
equal to that of equivalent synchronous ones. When a 
degradarionfacror (typically 50 '3% or greater [2]) is applied 
to the clock speed to dlow for temperature, supply voltage 
and process variations, asynchronous designs can exhibit 
3 significant performance advantage. Here we present the 
design of a sixteen bit fixed point parallel multiplier which 
achieves a performance level similar to that of non-derated 
synchronous designs. 
2 Asynchronous design techniques 
Asynchronous digital logic design removes the global tim- 
ing constraints of a clocked system. The flow of data is 
dictated by local timing considerations. This attribute is 
becoming increasingly important as feature size is reduced 
and chip complexity increases. Other potential advantages 
of asynchronous design include lower power consump- 
tion, simplified system level design, and greater product 
longevity. There are many approaches to the design of 
asynchronous system\, Hauck in (41 provides an excellent 
summary. Here we briefly review some methods used for 
reported multiplier designs which we will use for compari- 
son. 
An example of the bounded delay technique is the mi- 
cropipeliize [ 101. Each bit of data is represented as a one bit 
variable c'arried over a single wire. Arbitrary data widths are 
bundled and the flow of data is controlled by the exchange 
of comi non request and acknowledge handshake signals. 
The computational delay in the d?ta path is accounted for 
by introducing explicit delay elements in the control path. 
This technique of delay modelling must allow for the worst 
case delay of the data path. Two-cycle handshaking is used 
ie. a transition (either low-to-highor high-to-1ow)is used to 
signal an event. In order to use dynamic precharged logic, 
a pre-charge period between successive evaluation phases 
is required. Since two-cycle signalling does not have a 
retum-to-zero phase which can be used for the pre-charge 
signal it is difficult to incorporate precharged logic into a 
micropipeline and so computation within the pipeline is usu- 
ally implemented i n  static complementary logic (CMOS). 
This results in  larger and slower logic cells than achievable 
using dynizmic precharged logic because of the need for 
more complementary P type transistors. 
The rr~ultiplierpresented by Meng in [ 5 ]  is an example of a 
speed-independent design utilizing precharged Diflerential 
Cascode Voltage Switch Logic (DCVSL) for computation. 
Data is d i d  ruil encoded ie. two wires are used to 
encode each data bit. Inverters are necessary between 
succeeding computational stages to ensure that evaluation 
can commence only when the outputs of the preceding stage 
have settled. This introduces an inverter delay between 
each stage. The DCVSL provides complementary outputs 
which cxi be used for completion detection. Completion 
detection can produce a performance improvement when 
there is a data dependent variation in computation time. An 
example is a ripple carry adder where a completion signal 
can be generated when the carry propagation between stages 
is compllete [3]. However the time and area overhead for 
completiion detection inay outweigh its advantages. 
The Latched Differential Pass Transistor Logic (LDPL) 
design used by Salomon et a1 [SI is another dual rail 
technique which provides high speed with low power con- 
10.4.1 
IEEE 1994 CUSTOM IN1 EGRATED CIRCUITS CONFERENCE 
215 
0-7803-1886-2194 $3.00 I C '  1994 IEEE 
X Y 
Figure 1: Multiplier floor plan 
sumption. A feature of this method is the incorporation of 
(lala storage within the output buffer of the logic stage. This 
eliminates the need for data latching registers in pipelined 
designs. 
We have chosen to use a bundled data path with four- 
cycle handshaking and a delay model in the control path. 
The four-cycle signalling protocol provides a return-to-zero 
phase which can be used with precharged logic. 
13 Overall structure 
The multiplier uses sixteen stages of sixteen bit carry save 
adders (CSA) and a combination of Manchester carry adders 
:innti carry select adders to calculate the final thirty lwo bit 
result. It is implemented as  a five stage pipeline. The 
five stage pipeline is the result of a trade off between a 
throughput, latency, and chip area. Increasing the number 
of stages of pipelineing increases the number of interstage 
registers. This results in greater throughput at the expense 
of a longer latency time arid larger chip area. Five at;iges 
of pipelining maximize throughput while not exceeding 
Ihe allowable active die area (excluding pads) of +.24 
mm’ for the chosen prototype fabrication process (Orbit 
!$emiconductor tiny chip) [7 ] .  
The floorplan of the multiplier is shown in Figure 1. 
The X operand is input at the top of the array, while Y I S  
led in from the right. The array of carry save adders and 
the final Manchester adder stages occupy most of the ;hip 
;wc;t. Pipeline register stages for Y input and lower order 
outputs are on the right, while control circuitry occupies 
ii vertical stnp ktween the input pipeline and the adder 
imiy.  Recharged logic with a pull down evaluation tree 
of n-channel MOSFETS is used for all computational logic 
blocks. 
4 Progressive evaluation 
‘To ensure that the NMOS pull down tree evaluates correctly 
r t  is essential that evaluation of a given stage does not 
commence until all its inputs are correct and stable. One 
~ 
10.4.2 
Figure 2: Control path for one pipeline stage 
Figure 3: Latch and precharge signals for one pipeline stage 
technique to ensure this is to insert register stages between 
each logic block, this however increases latency by adding 
register latching time. Another method is to interpose 
inverters between each evaluation stage which increases 
stage delay. 
Our design is five stage pipeline using handshaking el- 
ements to generate local latching signals for the pipeline 
registers. Since the design is asynchronous, the precharge 
and evaluation phases of adjacent logic blocks are not 
constrained by a global clock system. Within each pipeline 
stage, adjacent evaluation stages are released from precharge 
after the output of the preceding stage has settled. The term 
we use for this technique is Progressive Evaluation (PE). 
This technique is similx to a multi-phase synchronously 
clocked precharged logic system. Timing for the precharge 
and evaluation phases is derived from taps in the delay model 
for the ovcrall coinputational stage as shown in Figure 2. 
The string of buffers is a delay line which models the delay 
of the adder stages. Muller-C-elements [6] co-ordinate data 
transfer between pipeline stages. When ;m input request is 
generated, and the stage is empty, latch signal (LO) goes 
high, latching the data into the input register. The output 
of the C-element is then driven high starting the evaluation 
of the first CSA (signal P1 changes from pre-chiuge state to 
evaluate). The signal progresses through the inverter chain 
causing signals P2 - P4 to go high in succession. Finally a 
latching signal is generated for the next pipeline register and 
all evaluation stages are returned to the precharged state. A 
SPICE simulation for one pipeline stage of four carry save 
adders is shown in Figure 3. 
216 
5 Circuit elements 
5.1 Carry save adder 
The circuit used for thc carry save adder stage is shown 
i n  Figure 4. Complementary inputs arid outputs are used 
to eliminate the need for inverter stages thus minimizing 
computation time. Simulations using normal transistor 
process models show that both sum andcany outputs which 
evaluate to a low state reach 50 % of rail voltage in 0.6 ns and 
25 % within0.8 lis. Because no pull-up transistors are active 
during the evaluation phase, the evaluation of the following 
stage can be commenced once any low level inputs have 
settled below the threshold of the NMOS evaluation tree. 
Allowing 0.8 ns between successive evaluations gives a 
safety margin of 50 %I with a typical half rail threshold. 
5.2 Manchester carry adder 
The final pipeline stage of the multiplier resolves the high 
order sixteen bits of the result. This achieved using a com- 
bination of Manchester carry adders, and carry select adders 
I 1 I]. The result is evaluated and latched in approximately 5 
ns, which matches the time taken for the previous pipeline 
stages. 
5.3 Pipeline registers 
The pipeline registers for X and Y inputs and output results, 
use nine transistor single-phase positive edge triggered D 
flip flops Ill. The circuit. shown in Figure 5(a) can be 
implemented with fewer transistors thin micropipeline style 
transition registers [lo] and exhibits near zero data hold 
time. An input inverter is used to give a non inverting 
latch. resulting in eleven transistors for each latch element. 
Simulations indicate. when implemented in the 1.2 micron 
CMOS process used for our design, the latch has a setup 
time of 0.8 ns and total delay of 1 11s under normal operating 
conditions. 
Carry and sum outputs are latched using the circuit shown 
in  Figure 5(b). This is a latched form ofa Muller C-element. 
When the latch signal (L)  is high and D is not equal to -D, 
the output will equal D. When D equals -D (le when 
precharged) the output will be held. When L is low the 
output will also be held. When used wilh the latching 
signals shown in Figure 3, the latch is enabled while the 
latch signal is high and assumes the correct value when 
the precharged signals evaluale. Simulation results show a 
delay of0.Yns from input to output. 
6 Performance 
SPICE simulations U ~ J I I ~  normal transistor parameter inod- 
eh [7] at an operating temperature of 27 degrees Celxius 
indicate the multiplier is capable of accepting input operands 
every 5.46 ns or an effective throughput rate of 183 MHz. 
Latency is 25 ns when the pipeline is full. but because of the 
~ 
10.4.3 
9 9 
A = X . Y  
Figure 4: Carry save adder circuit 
f 
Figure 5:  (a) Pipeline register latch (b) C-element latch 
elastic nature of the pipeline this is reduced to 19.6 ns for 
an empity pipeline. Table 1 compares the performance of 
several multiplier designs, the DCVSL multiplier reported 
by Meng [5]. LDPL, micropipeline and synchronous de- 
signs from Salomon et af [8], with this design. The entries 
shown for Area(scaled) have been scaled to 1 micron feature 
size for comparison purposes. Note that the DCVSL design 
is not pipelined. A synchronous design using techniques 
similar to those presented here (including Progressive Eval- 
uation) should be able to achieve performance equal or 
perhaps slightly better than our design (without allowing 
for a degradation factor). Such a design would not however 
have the: elastic pipeline properties. 
7 Conclusions 
A multiplier design which offers the advantages inherent in 
as ync hron ou s design without sacrificing performance when 
compared to other approaches has been presented. The 
design is part of a toolkit of asynchronous logic building 
blocks for DSP being developed by the CSIRO/Flinders 
Joint Research Centre. 
The niultiplier has been fully designed and simulated and 
is about to be submitted for fabrication i n  a 1.2 micron 
Technique 1 1  DCVSL 1 LDPL I Mpipe. 1 Sync. I PE F Feature size 1 1  1.6 I 1.0 I 1.0 I 1.0 I 1.2 
I
I m m '  II 8.1 I 2.59 I 2.64 I 2.53 I 3.03 I 
I1 I I I I 1 1  3.16 I 2.59 I 2.64 I 2.53 I 2.1 
I1 28.6 I 146 I 104 I 172 I 180 
Table 1: Performance comparison of several multipliers 
217 
II 
Figure 6: Mask layout of the multiplier 
CMOS process. The layout is shown i n  Figure 6. Oper- 
ational results should be available for presentation itt the 
conference. 
8 Acknowledgements 
The support of the Australian Research Council and the 
Australian Telecommunications and Electronics Research 
BOiud is acknowledged. 
References 
M. Afghahi iU1d C. Svensson. A Unified Single-Phase 
Clocking Scheme for VLSI system. IEEE Jourtiul of 
Solid-Srute C'ircuirs. 2 5(  1):225-231, February 1990. 
M.E. Dean. Strip: .A Self-Timed RISC Processor. 
Technical Report CSL-TR-92-543, Computer Systems 
Laboratory, Stanford University, July 1992. 
J .D. Garside. A CMOS VLSI Iinplementation of an 
Asynchronous ALU. In IFIP Working Conference 
on Asynchtonous Design Merhudologies, Manchester, 
England, April 1993. 
S. Hauck. Asynchronous Design Mettiodologiea: An 
Overview. Technical Report 93-05-07, Department 
of Computer Science and Engineering, University of 
Washington. Seattle WA, 1993. 
T.H-Y. Meng. Sync'hrotiizarion Design for  Digital 
Syslems. Kluwer Academic Publishers, 1991. ISBN 
0-7923-91 28. 
D.E. Muller. Asynchronous Logics and Application to 
Information Processing. Switching Theory in Space 
Technology. Stanford University Press, 1%3. 
Orbit Semiconductor Inc, Sunnyvale, CA. Foresighr 
Users Manual, rev 1.4, July 1991. 
0. Salomon and H. Klar. Self-Timed Fully Pipelined 
Multipliers. In IFIP Working Conference on Asyn- 
chronous Design Methodologies, Manchester, Eng- 
land, April 1993. 
J. SpiUso, C.D. Nielsen, L.S. Nielsen, and J. Staunstup. 
Design of Self-Timed Multiplers: A Comparison. In 
IFIP Working Conference on Asynchronous Design 
Methodologies, Manchester, England, April 1993. 
I.E. Sutherland. Micropipelines. Communications of 
the ACM. 32(6):720-738, June 1989. 1988 Turing 
Award Lecture. 
N. Weste and K. Eshraghian. Principles of CMOS 
VLSl Design. Addison Wesely, 198%. 
10.4.4 
