Asynchronous multipliers with variable-delay counters by Cornetta, Gianluca & Cortadella, Jordi
Asynchronous Multipliers with Variable-Delay Counters * 
Gianluca Cornetta 
Computer Architecture Dept. 
Universitat Politkcnica de Catalunya 
08034 Barcelona-Spain 
E-mail: cornettu @ ac. upc. es 
Abstract 
Although multiplication is an intensely studied arithmetic 
operation and many fast  algorithms and implementations are 
avalaible, it still represents one of the major bottlenecks of many 
digital systems that require intensive and fast  computations. This 
paper presents a novel design approach based on the well-known 
Baugh and Wooley algorithm, particularly appealing f o r  asyn- 
chronous implementations and that may be easily mapped into a 
VLSI circuit. This technique has been applied to the design of 
a high-speed variable-delay multiplier that resulted to be faster 
than other synchronous and asynchronous implementations. 
1 Introduction 
Multiplication is an intrinsically slow operation since a large num- 
ber of partial products have to be added in order to obtain the final 
result. The most common techniques to speedup multiplication 
aim at reducing the number of partial products to decrease the exe- 
cution time. This may be achieved by encoding the multiplier [4], 
or by using parallel counters [ 101. The algorithm proposed in this 
paper has an implementation inspired by [7] but, unlike [7], it has 
a data-dependent execution time. 
The idea of datapaths with variable execution time has al- 
ready been applied to the realization of several arithmetic oper- 
ations such as division and square root [18, 91, addition [ l l ,  151 
and multiplication [13], while a general method for synthesizing 
variable-delay pipelined datapaths has been described in [3]. For 
example the multiplier described in [13] is a bidimensional array 
of full adders with two possible delays. What determines the de- 
lay of a row of the array is the corresponding bit of the multiplier. 
This design approach limits the implementation only to radix-2 
arrays. 
The data-dependent computation times are determined by a 
speculation function as proposed in [9]. However, in this case it 
is not necessary the retiming of the algorithm in order to perform 
speculation and error detection and correction in parallel. In 
addition, unlike [13] the architecture described in this paper does 
not need an adder at the last stage, since digits are generated 
in redundant form starting from the most significant ones and 
conversion into non-redundant form may be performed on-the- 
Synchronization is achieved by means of a dual-rail encoding 
of the data bits [6]. This implies the use of differential logic. We 
choose to implement the basic cells using CPL gates [2]. The use 
of complementary pass-transistor gates is particularly appealing 
for low-power applications. Moreover, CPL can be faster than 
conventional CMOS logic. Nevertheless, the reduced output volt- 
age swing requires the use of buffers to obtain a full-swing output 
voltage. In addition the complementary output necessary to im- 
plement the dual-rail protocol requires extra transistors and leads 
to a larger area occupation when compared with standard CMOS. 
fly [14 .  
*This work has been partially supportedby the Ministry of Educationof Spain un- 
der CICYT, TIC 98-0410, by ACiD-WG (ESPRIT-21949)and by Intel Corporation. 
077803-7057-0/01/$10.00 02001 IEEE. 70 1 
Jordi Cortadella , 
Software Dept. 
Universitat Politecnica de Catalunya 
08034 Barcelona-Spain 
E-mail: jordic@ lsi. upc.es 
Like [7], the proposed multiplication scheme is based on a triangu- 
lar array of counters with a variable delay. The variable execution 
time is obtained by making the computation to be data-dependent. 
Data dependency is obtained by means of a speculation function 
that tries to predict the result by assimilating a reduced number 
of input bits. This leads to a faster execution time. However, 
since the output of a counter is a speculation, it may be wrong 
and a correction of the result may be necessary. What makes the 
proposed approach better than other similar implementations is 
that, in our case, prediction and error detection and correction run 
in parallel and since the error-detection function is faster than the 
speculation function we may activate the correction phase before 
the speculated value is issued. As a consequence, in case of pre- 
diction error the correction logic does not introduce any overhead 
in the execution time. This leads to high speedups compared to 
other implementations. 
The rest of the paper is organized as follows: in Section 2 
we describe the implementation of a standard multiplier based on 
parallel counters and outline the differences between this standard 
implementation and the proposed multiplication scheme. Sec- 
tion 3 deals with the implementation details. Section 4 compares 
our design with other synchronous and asynchronous designs. 
Finally, in Section 5 we draw up some conclusions. 
2 Design of Array-Multipliers Based on Parallel 
Counters 
In this section we deal with a multiplication scheme based on 
parallel counters. The use of counters or compressors permits 
to decrease the execution time of a multiplication since the over2 
all number of partial products to be added is reduced. We first 
describe the general architecture of an array-multiplier based on 
parallel counters and implementing the Baugh-Wooley algorithm. 
Next we will deal with the design approach we propose to achieve 
data-dependent computation times. Hence, we will introduce the 
performance metrics used to evaluate the design and focus our 
attention on some important design issues as well as on the design 
of counters with variable execution time. 
2.1 Basic Multiplication Scheme 
Figure 1 shows the architecture of a 8 x 8 multiplier [7] that 
implements the product A x B ,  with A = (a,, ag, .  . . , UO), 
B = ( b 7 ,  b 6 , .  . . , bo)  and a,, b, E (0, l} ,  using the Baugh- 
Wooley algorithm [l]. The array topology is triangular so that 
the most and the least significant halves of the product can be 
computed in parallel. The most significant digits pt’s of the result 
are generated in redundant carry-save form and fed into the on- 
the-fly conversion (OTFC) unit that operates in parallel with the 
multiplier. The OTFC unit converts the result from redundant into 
conventional representation [ 121. The dashed line separates the 
variable-delay part of the multiplier from the interface logic to the 
OTFC unit and from the logic that generates the least significant 
part of the result. Each element of the array is an (m,  n)  parallel 
counter [lo], that is an arithmetic circuit whose inputs are m bits of 
2 9  2 8  Y X  I. .... m 4 




weight 2" and whosen outputs ( n  = [log,(m + 1 ) 1 )  are the bits 
with weights from 2" to 2Wtn-1 and represent the arithmetic sum 
of the input bits. Counters are widely used in array-multiplication 
since they permit to reduce the overall number of partial products 
to be summed, thus reducing the execution time of the algorithm. 
Referring to Figure 1, (3,2)b and (2,3,3)  denote a binary fuU- 
adder and a 2-bit full-udder respectively. Each redundant digit p E 
of the b x b multiplier is composed of log, T + 1 bits. The 2-th 
radix-4 redundant digit pI is formed by three bits: we denote by 
st,1 and c , ,~  the two of them with weight 22b-2"+' and by C , , O  the 
one with weight 22b-2'. The assimilation of these bits produces a 
pt E [0,5], that must be convertedon-the-fly into a non-redundant 
radix-4 digit. The OTFC algorithm is described in detail in [8] 
and is derived from the one described in [7 ] .  
2.2 Area and Time Performance Criteria 
The designs are implemented using a set of full-custom CMOS 
CPL cells. Delays and areas are represented as multiples of 
the delay and area of a two-input XOR-XNOR gate. To make 
simulation realistic, parasitic resistances and capacitances were 
extracted and worst-case RC load due to wiring was estimated by 
assuming a draft layout plan [14]. Routing area was estimated as 
well by assuming a pitch of 5A between two adjacent METAL I 
wires and taking into account that a wire width is 2X. 
2.3 Achieving Data-Dependent Delays 
We propose a multiplier that produces a result with a telescopic 
delay [3]. Data-dependency is achieved by means of a prediction 
function. Let us consider, for the sake of simplicity, the case 
of a (5,3) counter; the extension to other types of counters is 
straightforward. A (5,3) counterhas five inputs and three outputs; 
let y = (y4,. . . ,yo) be the input vector and = ( 2 2 ,  Z I ,  20) 
theiutput vector. To shorten the execution time, we choose to 
compute the sum by assimilating a reduced number of bits. In 
the case of a (5,3) counter we use bits (y4, y3, y2) to predict the 
result and bits (yl , yo) for error detection and correction. We may 
define the prediction function as the function Pp(y+ y3, y2) = 
E:=, y1 + U where U is prediction of the value of the sum yl+ yo 
of the discarded bits. The speculated sum U is derived by taking 
into account the statistical distribution of the values of the sum 
y1 + yo. Since 0 _< yl + yo 5 2, our goal is finding a small 
interval [U,, uz] so as to: 
1. maximize P ( U I  5 y~ + yo 5 az); 
2. obtain a fast and simple implementation. 
Unfortunately these requirements are mutually exclusive, that is, 
the higher the number of correct predictions, the more complex 
P(y1  = 0)  P ( y 0  = 0 )  
"0;; ; 2 4  
the prediction function is. As a consequence a tradeoff between 
number of correct predictions and hardware complexity must be 
found. In [8] it has been demonstrated that, by assuming a uniform 
distribution of the operands, the probability P ( y r  = 0) for the 
counters of the first and second row of the 8 x 8 multiplier of 
Figure 1 is the one reported in Table 1 .  According to these 
. , I  
Counters. 
values, we found that P ( y l  + yo 5 1) > 0.9 for the counters 
of the first two rows [8]. Thus a very good choice could be 
u1 = 0 and U, = 1.  This also means that ~ ~ = o y z  is more 
likely to be E:=, y; or y; + 1 .  As a consequence, to each 
combination of the (y4, y3, y2) we may associate two possible 
values, namely yt or E;=, yt + 1, as we will see more 
in detail in section 2.4. This higher degree of freedom allows 
us to synthesize prediction functions with a very high number of 
correct predictions and a short execution time as it will be shown 
in the next section. The simulations performed have shown that 
P ( y l  +yo 5 1) > 0.9 also for (5,3) counters of larger multipliers. 
2.4 Variable-Delay Counters 
Figure 2 depicts a possible realization of a (5,3) counter. This 
realization is not the one used in this paper. It is explained to 
ease the description of the actual approach. In this scheme, that 
reminds a conditional sum adder [16], all the possible sum values 
are computed in parallel using bits y4, y3! y2, whereas bits yl and 
yo are used to select the correct value. The proposed scheme is 
Y4 Y3Y2 
1 1 I - -  
i 
Figure 2. A Possible Implementation of a (53) Counter. 
very similar to that of Figure 2, but where only the sum values more 
702 
likely to happen are computed. This choice is made according to 
statistical considerations. Since not all of the possible sums are 
computed, what is made in reality is a speculation of the result and 
hence a correction may be necessary if the prediction is wrong. 
Since this class of counter exhibits a variable-delay behaviour we 
call it variable-delay counter (VDC). 
2.4.1 Boolean Relations for Fast Prediction 
Let us consider the case of a ( 5 , 3 )  VDC. Prediction is car- 
ried out by function FP(y4 ,  y3, y2) = E:=, y; + a, where 
U = a(y4, y3, y2) is a function that returns an integer value 
such that U E [0, 11. This means that for every combination 
of (y4, y3, yz) we may choose between two predictable values, 
namely, we may choose to associate to F P ( y 4 ,  y3,y2) either 
Er=? y; or Er=, yi  + 1, this may be specified by using Boolean 
relations [5]. Boolean relations allow a higher degree of freedom 
during the synthesis since they permit to associate to each combi- 
nation of the input variables of a boolean function f several valid 
output values. Among all the possible output values the one that 
produces the f with the simplest implementation is chosen. How- 
ever a “classical” approach to prediction and correction, like in [9] 
does not produce great improvements of the average execution 
time, since only a value at a time is predicted and the rate of cor- 
rect predictions may be not sufficiently high. But if we anticipate 
the computation of the value not predicted by F P  by means of other 
functions that work in parallel, two predictable values at a time 
will be avalaible. This will result in a higher rate of correctpredic- 
tions and hence in a smaller average delay of the counter. This task 
is camed out by functions FP1(y4, y3,. y2) = E:=, yi + U - 1 
and F:l (y4,y3, y2) = y; + U + 1. In addition function 
a (y4 ,  y3, yz) is necessary to identify which value has been pre- 
dicted by function FP(y4 ,  y3, y ~ )  in order to perform the selection 
among the predicted values and the correction, in case of wrong 
prediction. A correct choice of UI and u 2  is crucial since it deter- 
mines: 
1. the number ( a 2  - ai) + 1 of possible output combinations of 
F P  that may be associated to each combination of the input 
variables by means of boolean relations; 
2. the number Llog2(oz - UI)] + 1 of bits necessary to encode 
5; 
3. the number 2 ( ~ 2  - 0 1 )  of functions F,P’s. 
Finally, Figure 2 shows the truth table of the prediction func- 
tion. The central column reports the output specification using 
boolean relations, whereas the rightmost one shows the selection 
performed by the boolean relation solver [ 171. 
Y4Y3YZ 2 2 2 1 2 0  
Possible Uutputs 
n 000 II 0000.001 1 \ 
00 10; 0 I O  1 11 :; 11 / o l O O , O l l l ~  0010,0101 
00 10.0 10 1 
0100,0111 
0110,1001 
1 iyp 11 0100,0111 1 












Table 2. (5,3) VDC Prediction Function Truth Table for 
the Boolean Relation. 
2.4.2 Architecture 
Figure 3(a) depicts the basic scheme of (5,3) VDC. The prediction 
function FP = E:=, yt + U operates in parallel with functions 
FEl = yt + U - I ,  F f ,  = y, + a + 1 and U .  Thus 
function FP, handles the case in which FP = E:=, yt + 1 and 
yo + yl = 0 whereas function FT1 handles the one in which F P  = 
Et2  yt and yo + yl = I .  Since a E [0, I], the only combination 
of yl and yo that produces a prediction error is (yl yo) = ( I ,  I ) ,  
that is the one such that yl + yo = 2 $? [al,a2], hence the 
error detection function e(y l ,yo)  is simply e = yl A yo and 
the correction function is F c ( F P ,  U) = F” + (2 - U). Input 
combinations (y l ,  yo) E {(O, 0 ) ,  (0,  l ) ,  (,l, 0)} identify a good 
prediction. The case in which yl +yo = 0 is managed by function 
so(yl,  yo,) = (yl V yo ’, whereas the case y l  + yo = 1 is managed 
by function sl(yl, y d  = y~ @ yo. For example, considering the 
scheme of Figure 3(a) and the simplified truth table of Figure 3(b), 
F P  is selected if and only if so = a’ = 1 or SI  = U = 1. 
According to these considerations, we deduce that a ( 5 , 3 )  VDC 
performs the following selection: 
F P  
FPl if (SI A U ’ )  
F!, if(so A U )  
F c  i f e  
if (a A SI )  V ( S O  A a’) 
(1) 
From equation ( 1 )  and Figure 3 we deduce that bits y l ,  yo and 
U permit to switch between two different delays: a short delay 
in case of correct prediction, and a long delay in case of wrong 
prediction. Table 3 shows the boolean equations of a (5 ,3)  VDC. 
The equations of all the designed VDCs may be found in [8]. 
Table 3. (5,3) VDC Prediction and Correction Functions. 
3 Implementation 
We have designed a radix-4 array for multiplication. All the ba- 
sic processing elements are implemented using elementary CPL 
gates [2] since they are faster and less power consuming than 
standard CMOS logic. The designed gates have been modified 
to permit using a 0.8pm and modified to permit the precharge 
of the output nodes. The whole design has been simulated us- 
ing HSPICE. Prediction functions have been synthesized using 
Boolean relations [ 171, in order to produce a minimum cost func- 
tion. In [S I  we also prove that the designed cells have a monotonic 
behaviour. This is crucial to assure the correctness of the dual-rail 
protocol [6]. 
To give technology-independent estimations, delays and ar- 
eas are normalized with respect worst-case delay and area of an 
unloaded XOR-XNOR gate. Area estimations take into account 
the area due to interconnections among the cells. It must be also 
pointed out that the proposed scheme, as well as the one described 
in [7], produces the most significant half of the result in redundant 
carry-save form. Hence the result is converted on-the-fly into 
non-redundant form as described in [8]. 
3.1 Synchronization and Timing 
Figure4 shows a sketch of implementation. In Figure 5 is reported 
the four-phase handshake protocol of the proposed multiplication 
scheme as well as its reset time (2txoR).  As long as the input 
operands are not valid, the circuit is held in the precharge state, 
forcing the outputs of all the VDCs to logical “0’. When the 
operands are valid the PIE‘ signal goes down and the circuit en- 
ters the evaluation phase. The completion detection logic (whose 
operation is described in section 3.2) forces RDY high, as soon 
703 
U Y4Y3Y2 11 ) ; ? A + u - - l  I ) : Y , + u  I ) ; Y l + a + l  1 0  (1 
010 
100 00 1 010 
101 010 01 1 
010 01 1 110 
1 1 1  
_ _ -  
01 1 1 no 
ZR 
I = ”  
(a) (b) 
Figure 3. A (5,3) VDC: (a) Basic Scheme, and (b) Prediction Function Simplified Truth Table. 
CARRY-SAVE 
Figure 4. Sketch of Implementation. 
as a valid result has been produced. RDY goes down when the 
input operands are no longer valid, that is when the circuit is 
forced in the precharge state by PIE’ going high. To generate 
a correct RDY signal efficiently and using a reduced amount of 
hardware resources, the carry-save array and the OTFC must be 
synchronized. To achieve this we exploit the fact that the carry- 
save array generates the result digits sequentially from the most 
to the least significant one. As a consequence we may delay the 
last stage of the OTFC unit keeping it precharged, until the least 
significant digit is valid. All the implementation details may be 
found in [8]. 
IN 
P E  
OUT 
RDY _ -  
2 tXOR 
Figure 5. Timing. 
3.2 Completion Detection 
To detect the end of a multiplication we have to detect the end 
of the conversion into non-redundant form. This is done using a 
completion detection scheme based on the one proposed in [15] 
and depicted in Figure 6. In the precharge phase the RDY output 
is set to “0” and, since the differential outputs m; and mi of the 
OTFC are NORed, all the n-mos pull-downs conduct, keeping the 
output low. In the evaluation phase, as soon as the last differential 
output is valid, all the pull-downs will be disabled and the p-mos 
pull-up will push the RDY output at logical “1”. The extra latency 
introduced by the completion detection logic is only t X O R ,  ~ X O R  
being the delay of a XOR gate. 
3.3 Extension to Higher Radices 
Along this paper we have dealt exclusively with (5,3)-VDCs as 
basic processing elements of multiplicative arrays of several sizes. 
However the proposed design approach may be extended to higher 
radices. [8] reports also the booleanequations for (6 ,3)  and (7,3)  
VDCs used to implement radix-8 and radix-16 multipliers. We 
must remark that, in the proposed multiplication scheme, each 
row of the carry-save array that computes the least significant half 
of the result must generate log, T bits in carry-assimilated form. 
This task is performed by a fast carry-propagation adder. For high 
values of the radix T ,  the latency introduced by the adder may be 
critical, restricting our design methodology to radices not bigger 
than 16. 
Prom OTFC mi5 
%n- =sn-* m. mi ~[ ’ ti .................... qv 
P/E’i .................... 
Figure 6. Completion Detection Logic. 
4 Comparisons 
We will compare the designs both in terms of delay and in terms 
of area, even if the implementation described in [13] does not 
require any supplementary hardware for the on-the-fly conversion 
as it is done in [7] and in the multiplication scheme presented in 
this paper. In fact, the use of an OTFC unit requires extra area 
respect to a design like [13] that uses an adder at the final stage 
of the array. In order to give coherent estimations, all the designs 
have been implemented using the same technology and CPL gates. 
For the same reason we will limit our analysis to the radix-4 case, 
that is the case of the multiplication scheme of [7 ] .  On the other 
hand, although the implementation proposed in [13] is a radix- 
2 scheme we will compare our design all the same, since with 
our approach is impossible to synthesize data-dependent counters 
smaller than (4,3) whereas the approach described in [ 131 is only 
suitable for radix-2 arrays. Delays and areas will be expressed as 
multiples of the delay and area of a two-input XOR-XNOR gates. 
The average delays of the asynchronous implementations have 
been computed by simulating lo4 multiplications with randomly- 
generated operands and assuming a uniform distribution. 
4.1 Delay Estimation 
The proposed multiplication scheme has a delay A V D C M  (where 
V D C M  stands for variable-delay counter multiplier) that is the 
sum of four contributions: AVDCM = tarray + t o t  j c  + nlbuf + 
tsyno where tarray is the latency of the array, t o t f c  the latency 
due to the conversion logic, n E [0,3] is the number of buffers 
(with delay t b u f )  necessary to drive the stage of the OTFC unit 
704 
Delay 
b n VDCM LCMY61 1 /r" 
186 ? z%!i :;:;: :?:I: 
24 2 59.79 85.81 88.59 
32 3 76.69 111.86 121.13 
vs. KB * 
8 16 24 32 
b 







4.2 Area Estimation 
In this section we evaluate the area occupation of both the OTFC 
unit and the multiplying array. Table 5 shows the overall areas 
of the implemented multiplication schemes obtained considering 
the areas of the array for partial products generation, the array 
of basic processing elements and the conversion unit or the final 
adder in case of [ I  31. Areas are expressedas multiples of the area 
of a XOR-XNOR gate. 
5 Conclusions 
Variable-delay counters (VDCs) are the basis for the design of 
the asynchronous multiplier presented in this paper. The trade- 
off between prediction accuracy and calculation speed has been 
explored by using Boolean relations for the specification and min- 
imization of the behaviour of VDCs. This has resulted in a novel 
Table 5. Normalized Overall Areas of the Implemented 
Multipliers. 
design that for the first time, to the knowledge of the authors, 
outperforms the efficiency of synchronous multipliers. 
Counters are highly used in arithmetic circuits. We think we 
have only presented one possible application and that the use of 
variable-delay building blocks and the exploration of implemen- 
tations by means of Boolean relations can be combined in many 
other areas that may benefit from the average-case performance 
offered by asynchronous circuits. 
References 
[l] C. R. Baugh and R. A. Wooley. A Two's Complement Parallel 
Array Multiplication Algorithm. IEEE Transaction on Computers, 
C-22(12):1045-1047, December 1973. 
[2] A. Bellaouar and M. I. Elmasry. Low-Power Digital VLSI Design. 
Kluwer Academic Publisher, Norwell, MA, 1995. 
[3] L. Benini, G. De Micheli, A. Lioy, et al. Automatic Synthesis of 
Large Telescopic Units Based on Near-Minimum Timed Superset- 
ting. lEEE Transaction on Computers, C-48(8):769-779, August 
1999. 
[4] A. D. Booth. A Signed Binary Multiplication Technique. Quarterly 
Journal Mech. Appl. Math., 4, Part2:236-240,195 1. 
[5] R. K. Brayton and E Somenzi. Boolean Relations and the Incomplete 
Specification of Logic Networks. In Inr. Con$ Very Large Scale 
Integration, 1989. 
[6] J. A. Brzozowski and C.-J. H. Seger. Asynchronous Circuits. 
Springer-Verlag, New York, 1994. 
[7] L. Ciminiera and P. Montuschi. Carry-Save Multiplication Schemes 
Without Final Addition. IEEE Transaction on Computers, C- 
45(9): 1050-1055, September 1996. 
[SI G. Cornetta. Design and Analysis of Variable-Delay Arithmetic 
Units. PhD thesis, Polythecnic University of Catalonia-Dept. of 
Computer Architecture, Barcelona, September 2001. 
[9] J. Cortadella and T. Lang. High-Radix Division and Square Root 
with Speculation. IEEE Transaction on Computers, C-43(8):9 19- 
93 1,  August 1994. 
[IO] L. Dadda. Some Schemes for Parallel Multipliers. Alta Frequenza, 
34:349-356, March 1965. 
[ 1 I ]  A. De Gloria and M. Olivieri. Statistical Carry Lookahead Adders. 
IEEE Transaction on Computers, C-45(3):340-347, March 1996. 
[ 121 M. D. Ercegovac and T. Lang. On the Fly Conversion of Redundant 
into Conventional Representations. IEEE Transaction on Comput- 
ers, C-36(7):895-897, July 1987. 
[13] D. Kearney and N. W. Bergmann. Bundled Data Asynchronous 
Multipliers with Data Dependent Computation Times. In 3rd IEEE 
Symposium on Advanced Research in Asynchronous Circuit and 
Systems, pages 186197,1997. 
[ 141 G. Matsubaraand N. Ide. A Low Power Zero-Overheadself-Timed 
Division and Square Root Unit Combining a Singe-Rail Static Cir- 
cuit with a Dual-Rail Dynamic Circuit. In Symposium on Advanced 
Research in Asynchronous Circuit and Systems, pages 198-209, 
1997. 
[IS] S .  M. Nowick, K. Y. Yun, P. A. Beerel, and A. E. Dooply. Specula- 
tive Completion for the Design of High-Performance Asynchronous 
Dynamic Adders. In Symposium on Advanced Research in Asyn- 
chronous Circuit and Systems, pages 210-223,1997. 
[I61 J. Sklansky. Conditional Sum Addition Logic. IRE Transaction on 
Electronics Computers, EC-9:226-23 1, June 1960. 
[ 171 Y. Watanabe and R. E. Brayton. Heuristic Minimization of Multiple- 
Valued Relations. IEEE Transaction on Computer-Aided Design of 
Integrated Circuits, 12(10): 1458-1472, October 1993. 
[I81 T. E. Williams and,M. Horowitz. A 160ns 54-hit CMOS Division 
Implementation Using Self-Timing and Simmetrically Overlapped 
SRT Stages. In 10th IEEE Symposium on Computer Arithmetic, 
pages210-217,l99l. 
705 
