A multi-radix approach to asynchronous division by Cornetta, Gianluca & Cortadella, Jordi
A Multi-Radix Approach to Asynchronous Division * 
Gianluca Cornetta Jordi Cortadella 
Computer Architecture Dept. Software Dept. 
Universitat Politiknica de Catalunya Universitat Politkcnica de Catalunya 
08034 Barcelona-Spain 08034 Barcelona-Spain 
E-mail: cometta@ac.upc.es E-mail: jordic@lsi.upc.es 
Abstract 
The speed of high-radix digit-recurrence dividers is 
mainly determined by the hardware complexity of the 
quotient-digit selection function. In this paper we present 
a scheme that combines the area efficiency of bundled data 
with data-dependent computation time. In this scheme the 
selection function is very simple and may be implemented 
using a fast adder: This function speculates the result digit 
and, when the speculation is incorrect, a correction of the 
quotient and of the residual must be performed. When 
the residual satisfies some constraints it is also possible 
to switch to a higher radix, computing a fraction of the next 
digit in advance. This results in a division scheme with a 
variable iteration time and a variable number of iterations 
and hence with an asynchronous behaviour: Several designs 
were realized and compared both in terms of execution time 
and area. The fastest unit considered is a radix-44 divider 
that may switch to radix 128 or 254. Our evaluations show 
that area x delay savings from 25% to 65%, compared to 
equivalent synchronous designs, may be achieved. 
1 Introduction 
High-speed arithmetic operations are becoming impor- 
tant also in general purpose processors. While sub-micron 
technologies are an important step toward addressing the 
latency problem, they are not a complete solution. As chips 
become larger, problems such as clock distributionand skew 
are a major concern. A possible design tendency to tackle 
these problems in future processors is asynchronization [3]. 
In this paper we explore a novel design technique suitable for 
the design of asynchronous arithmetic circuits for division. 
Among all the known algorithms for division, the ones 
involving recurrence [ 101 exhibit a good tradeoff between 
'This work has been partially funded by the Ministry of Education of 
Spain under CICYT, TIC 98-041 0, ACiD-WG (ESPRIT 21949), a grant by 
Intel Corporation and CIRIT 1999SGR-150. 
performance and area occupation. Such class of algorithms 
computes a quotient digit per iteration, thus resulting in lin- 
ear convergence. Consequently, to reduce the number of 
iterations, it is convenient to use a higher radix for the result 
digit. Nevertheless, the higher the radix, the higher the com- 
plexity of the result-digit selection function. The complexity 
of the selection function increases the iteration delay as well 
and eliminates the advantage. Several techniques to improve 
the execution time of the algorithm have been proposed, in- 
cluding prediction, operands prescaling and overlapping of 
several selection stages [ I ,  18, 19, 7, 8 ,9 ,  101. 
Previous works related to asynchronous division [ 19, 13, 
121, deal with radix-2 division. This paper, to the au- 
thors' knowledge, is the first dealing with a design technique 
suitable for implementing very-high radix asynchronous di- 
viders. There are several differences between [ 19, 13, 121 
and the algorithm we describe in this paper. The architec- 
tures reported in [19, 13, 121 are self-timed loops formed 
by several pipelined stages. Each stage represents a radix-2 
iteration step of the standard recurrence division [ 101 and op- 
erates as soon as the required operands arrive. Synchroniza- 
tion between adjacent stages is achieved by using DCVSL 
differential logic and a dual-rail handshake protocol [2]. 
Each stage computes all the possible assimilations of the 
partial residual in advance, and since this number depends 
on the maximum quotient digit, this approach is clearly lim- 
ited to radix 2. 
As mentioned previously, the use of a low radix is highly 
penalizing because it increases the number of iterations nec- 
essary to complete a division. The proposed algorithm over- 
comes this intrinsic limitation of the approaches described 
in [19, 13, 121 allowing the implementation of very-high 
radix asynchronous dividers. 
The proposed approach is general and suitable for any 
kind of hardware implementation, including self-timed 
loops, even if in this paper we present a standard imple- 
mentation with cascaded CSA stages [ 101 suitably modified 
to interact with a bundled-data handshake circuit. 
Unlike [19,13, 121 where each stage is designed to have 
25 
1522-8681/01 $10.00 0 2001 EEE 
two different delay paths, and to operate as soon as the 
inputs are valid, the proposed algorithm achieves variable 
execution times by means of a selection function with data- 
dependent computation times 1.51. This selection function 
speculates the quotient digit. In case of correct prediction, 
the completion-detection logic is switched on the delay chain 
that matches the best-case datapath delay. If the speculation 
is wrong the error must be corrected. This requires an addi- 
tional latency and the completion-detection logic is switched 
on the delay chain matching the worst-case datapath delay. 
The correctness of a speculation is checked by an error 
detection and correction function. However, unlike [51, 
speculation and error detection are parallel processes, hence 
a wrong speculation does not need any additional iteration 
and the execution time results to be shortened also in case of 
wrong speculation. Parallelism between these processes is 
achieved by speculating the result digit using the integer part 
of the shifted partial remainder, whereas the fractional part 
is used to determine the correctness of the speculation. This 
also leads to very simple, and hence faster, implementations 
since the selection function may be realized using an adder. 
In a variation of this scheme, we allow the possibility to 
switch to a higher radix when the partial residual satisfies 
certain constraints. This permits to increase the number of 
bits of the result computed in an iteration. This technique, 
that reminds the approaches proposed in [5,  141 results in an 
implementation with a variable number of iterations. This 
approach is a further advantage with respect the implemen- 
tations described in [ 19, 13, 121 since those designs perform 
a division in a fixed number of iterations with a variable 
iteration delay, whereas the proposed realization performs a 
division with a variable number of iterations and a variable 
iteration delay, thus decreasing the overall execution time of 
the algorithm. 
In summary, the variable-delay behaviour is achieved 
through data-dependent computation times by means of a 
very simple speculation and error detection and correction 
function and by a function that permits to switch to a higher 
radix when the divider input matches certain constraints. 
Synchronization is achieved by means of bundled-data with 
a four-phase handshake protocol [2]. This allows to imple- 
ment division units with a reduced area since we avoid the 
use of differential logic with dual-rail synchronization. 
We develop the method, evaluate alternative possibilities 
and present some examples of implementation, comparing 
them with the implementation of conventional and specula- 
tive algorithms using the same technological constraints. 
This paper is organized as follows. Section 2 summa- 
rizes the division algorithm and notation. In Section 3, 
we introduce the basic idea behind quotient digit specula- 
tion as well as the idea to switch to a higher radix when 
certain constraints are matched. Quotient digit speculation 
and operand scaling, as a technique to improve the rate of 
correct predictions, are discussed thoroughly in Section 4. 
Section 5 gives an example of how the proposed algorithm 
works. In Section 6, we summarize the evaluation criteria 
of the implemented units. Section 7 reports the characteris- 
tics of a radix- 16 unit implementing the proposed algorithm. 
Finally, in Section 8, we draw up some conclusions. All the 
implementation details and formal proofs are reported in [4]. 
2 Division Algorithm and Notation 
We now briefly review the well-knoswn division algorithm 
and outline the notation used along thi.s paper. The standard 
recurrence used for division is [IO]: 
~ [ j  + 11 = T . ~ [ j ]  - q j + l  . d ,  w[O] = 2, 
where w[j] is the full-precision partial residual after the j -  
th iteration, T is the radix, q j + l  is the quotient digit, d the 
full-precision divisor and d: the full-prlecision dividend. For 
the sake of simplicity, we assume that 2 and d are positive 
normalized fractions and that 2 < d so that the division 
always starts with the shift of the partial residual and the 
quotient Q result a positive normalized fraction as well. 
To have a fast iteration, a carry-save adder is used, with 
the partial residual in carry-save redundant form, although 
a similar development could be done for other redundant 
representations of w[j]. Redundant quotient digits are con- 
verted on-the-fly [6] into a conventional representation. 
I 5 a with re- 
dundancy factor p = 5 (with p E (;, I]). This requires 
w[O] 5 pd, which is obtained by shifting the dividend. 
Moreover, to assure the convergence of the algorithm, the 
partial residual must be bounded [lo], namely: 
The quotient digit is a signed digit 
The quotient digit q j + l  is determined by a quotient-digit 
selection function F .  This function depends on an estimate 
w d  
1 1  
quotient- 
digit U selection 
quotient 
digit C- 
-ad O +ad mtjl 
Figure 1. Iteration Step of Digit-Recurrence 
Division. 
of the residual and on an estimate of the divisor; that is: 
26 
Short Iteration 
.witched 
delays 
Spec. 
1 
ack 2Mux+CSA Err. Dot. - -  
I 4 
t- Long Iteration 
(b) 
Figure 2. Basic Scheme: (a) Block Diagram, (b) Timing Diagram. 
The basic scheme that implements the iteration step is shown 
in Figure 1. There are several ways of implementing the 
selection function F .  Most of them are described in detail 
in [ lo]  
3 Division Scheme with Quotient Digit Spec- 
ulation 
We now describe the basic scheme, without higher radix 
switch, and then introduce the scheme with higher radix 
switch (HRS). 
3.1 Basic Scheme 
The basic scheme is depicted in Figure 2(a). As al- 
ready mentioned in the introduction, the quotient digit is 
speculated. The theory developed to design the speculation 
function will be described in detail in the next section. The 
speculated digit is used to compute an interim residual 
tE[j + 11 = r . w[j] - ~ j + ~  . d. The next residual is then 
calculated as follows: 
if tE[j + I ]  is bounded 
W [ j + l ]  = { 
q;+, is the correction digit, and is such that q j + l  = 
qj+l  + q f + l  with q f + l  E { - l , + l } ,  whereas in case of 
correct speculation q j + l  = q j + l .  As we will see later in this 
paper, the bound checking operation, necessary to check 
whether a speculation is correct or not, is performed by ex- 
amining w[j] and not S [ j  + 13. This leads to fast execution 
times. As shown in Figure 2(b) the proposed scheme for 
division exhibits a variable delay behaviour. Handshake 
signals are generated using switched delays [ 15, 16, 113 and 
a four-phase protocol [2] using request (res) and acknowl- 
edge  (ack) as handshake signals. The increased iteration 
length in case of wrong prediction depends on the hard- 
ware complexity of the error-detection function. However 
since speculation and error detection are overlapped, the 
time overhead due to a wrong prediction does not affect the 
performance of the algorithm significantly. 
:! - q j+ ,  . d otherwise 
( 3 )  
3.2 Scheme with Higher Radix Switch (HRS) 
In the basic scheme, two situations may occur: either we 
have a short-latency iteration if the speculation is correct, or 
a long-latency iteration if the prediction is wrong and a cor- 
rection must be carried out. The delay of the implementation 
is reduced by the scheme with HRS. In this scheme, when 
the residual is small, a third situation is allowed, namely the 
switching to a higher radix R, so that a fraction of digit is 
computed in advance, thus reducing the overall number of 
iterations necessary to complete the division. Radix R is 
such that R = p r ,  where r is the original radix and p is 
a power of 2. The amount of bits of the advance must be 
selected so that the next residual is bounded. That is, it is 
possible to calculate log, p extra bits i f  
Consequently, taking into account that R = p r ,  we obtain 
that: 
Iw[j + 111 5 ! d .  (4) P 
Finally, Figure 3 shows the basic scheme of the implemen- 
tation with HRS. The multiplexer used to select the amount 
of shifting at the end of an iteration increases the iteration 
latency, nevertheless the reduction of the overall number of 
iterations due to the HRS increases the performance of the 
algorithm . 
4 Quotient Digit Speculation and Operands 
Scaling 
As indicated in the previous section, the idea is to re- 
duce the latency of the quotient-digit selection by using a 
simpler function that gives a correct value with high proba- 
bility. The complexity of this function is related to its delay, 
namely the simpler the selection function, the shorter the 
latency. However, since the probability of a correct predic- 
tion decreases as the hardware complexity of the speculation 
function decreases, all the design space must be accurately 
27 
W 
quotient- 
digit 
spec. x detection error S Q' ... ... __1 pS+l fast/slor 
shift se1 
Wlj+lI 
radix shift l o g s /  
1OgzR 
b €(I 
fast L 
SlOW 
:. R> 
P ack 
Figure 3. Scheme with Higher Radix Switch. 
examined and a tradeoff must be found between accuracy 
and simplicity. 
We develop first the theory for the case of full-precision 
residuals and divisors and then extend it to the case with 
residual estimates. In particular, we propose to speculate 
the quotient digit considering only the integer part of the 
shifted residual (represented in carry-save form), namely: 
u;+l = Int(.w[jl); (5) 
consequently a fast adder may be used to implement the 
selection function. Since q j + l  is only a speculation it may 
be incorrect. 
Equation (5) does not depend on d. We can achieve this 
by operand scaling [SI. Namely we want to find a scaling 
factor M such that: X = Ma: (X = w[O]) and D = Md. 
The scaling factor M must be such that the divisord is scaled 
in an interval [&in, D,,,], being Dmin = 1 - 2-6 and 
D,,, > Dmin, whereas 5 is the number of bits of d that 
must be evaluated in order to compute the scaling factor A4. 
The value of 6 must be suitably calculated so that: 
Dmin = 1 -2-6  x 1. (6)  
As it will be explained later in this paper, this is necessary 
to improve the rate of correct predictions of the selection 
function. The upper bound D,, of the scaled divisor 
range is such that: 
Dmax = 1 + Q (7) 
where 0 < CY << 1 depends on p ,  b and a. In [4] we proved 
that in order to guarantee the convergence of the algorithm 
6 must satisfy the following equation: 
p 2 ( a  +p)2-? (8) 
whereas CY must be such that: 
1 - p( l  - 2-6) 
C Y <  
a (9) 
In conclusion, to simplify the selection function and to in- 
crease the number of correct predictions, the, operands must 
switched 
delays 
be preprocessed by a scaling function before executing the 
recurrence (3). The scaling requires two extra iterations. 
However this technique is still advantageous because the 
iteration latency is reduced. 
4.1 Selection Function Design 
The selection function, speculates anid eventually corrects 
the quotient digit. The error detection and correction is 
carried out by examining the fractional part . f ~ u c ( r w [ j ] )  
of the assimilated residual. Considering equation (5 ) ,  and 
supposing that q j + l  = qj+l  (hence by equation (3) w [ j  + 
11 = G [ j  + l]), the recurrence becomes: 
w[j  + 11 = T . wb] - q;+, . d 
(1 - d)Int(rw[j])  + f rac(rw[j]) .  = 
After operand scaling, the recurrence becomes: 
w[ j  + 13 = (1 - D)Int(rw[j])  + ,frac(l"w[j]). (10) 
4.1.1 Correct Prediction 
Selection of quotient digits is essentially a bound-checking 
operation where the residual is compared with some bounds 
that depend on the scaled divisor D. To simplify the se- 
lection, making it independent from D,  we may restrict the 
selection bounds imposing that Iw[j] I :; pDmin. The value 
of Dmin and hence of 6 must be suitably chosen in order to 
guarantee the convergence of the algorithm. Consequently 
we may impose that (10) must be bounded. To obtain con- 
stant bounds we consider the worst case. In [4] we show how 
this happens for D = Dmin and Int(rw[j])  = a. Hence 
we obtain: 
If equation (11) holds then the selected digit is q j + l  = 
q;+l = Int(rw[j]) .  
28 
4.1.2 Error Detection and Correction 
A prediction error occurs when, after a prediction the resid- 
ual falls out of the bounds fixed by equation (1 l), or equiv- 
alently i f  
Iw[ j  + 111 > PDmin. 
In [4] it has been demonstrated that when this situation 
occurs the worst-case interim residual is 6 [ j  + 11 % 1. 
Moreover the smallest convergence interval is such that 
Iw[ j  + 111 < i, hence to guarantee the convergence of 
the algorithm the interim residual must be corrected by &D 
and the speculated digit by f 1. 
As a consequence, taking into account equation (1 1) 
and imposing that ( p  + u)Dmin - a = B and that 
- ( p  + a)Dmin + a = E, the selection function becomes: 
i f B  5 f r a c ( r w [ j ] )  5 77 
Q j + l  = s I ~ n t ( ~ w [ j ] ) l  sIInt(rw[j]) + 1 I otherwise 
(12) 
{ 
where s is such that: 
(13) 
$1 i f rw[ j ]  2 0 
-1 if rw[j] < 0 s = sign(rw[j]) = 
From equation (1 1) it is evident that the larger Dmin (and 
hence the larger S ) ,  the larger the rate of correct predictions, 
since this increases the range of the bounds of f r u c ( r w [ j ] ) .  
4.1.3 Using Residual Estimates 
Until now we have considered full precision residuals. How- 
ever, to reduce the adder size and to simplify the selection, 
we use a truncation S[ j]  to the t-th fractional bit of the full- 
precision redundant residual to select the quotient digit q j + l .  
We perform a truncation error that affects the upper bound 
of the convergence interval. Since the maximum truncation 
error E ,  in case of redundant residuals in carry-save form is 
2-t+', the convergence is assured i f  
-pD,;, 5 S[ j ]  5 +pD,i, - 2-1+ ' .  
Considering this new bounds constraint, in [4] we found 
that: 
Obviously 5 must also satisfy the constraint defined by equa- 
tion (8). Hence for a radix-r divider we need to evaluate 
log, r + t + 1 bits of the residual, in order to speculate and 
eventually correct a quotient digit. Naturally, log, T + 1 is 
the number of bits of the integer part of the residual. Ac- 
cording to these considerations, the selection function with 
truncated residuals becomes: 
Where 3 = s ign(G[j] )  and E = 2-t+' is the maximum 
truncation error. Equation (14) fixes a lower bound for 
t .  Also in this case the design space must be accurately 
explored since t affects the rate of correct predictions. In 
particular, from equation (15) we note that the larger t ,  the 
higher the rate of correct predictions. However a large t 
implies a large adder and hence an increase of both hardware 
complexity and latency of the implementation, so a tradeoff 
must be found. 
4.1.4 Simple Scaling 
To compute the scaling factors we use an approach very 
similar to the one described in [8, 101. Proofs and imple- 
mentation details are reported in [4]. We want to multiply the 
divisor d and the dividend z by a radix-4 scaling factor M = 
C:=omi4-i, such that mo E {1,2}, ml E { - l , O ,  112}, 
m2 E {-2,-1,01 l12}andm3 E { - 2 , - - I 1 O 1  1,2)andthat 
M d  2 1 - 2-6. The radix-4 scaling is the one that offers 
the best tradeoff between performance and ease of imple- 
mentation [lo]. In [4] we showed that the scaling factors 
M ,  may be computed using the following relation: 
n being an integer such that 1 5 n 5 2&-' - 1, whereas 
[ - y l h s  = & [26s 71 and 5, = 3 log2 4 (3 being the number 
of fractional radix-4 digits of the scaling factor). We also 
proved [4] that the scaling factors, and hence 6, limit the 
maximum digit value a; namely that: 
r -  1 
(17) 
Equation (17) is very important because it determines the 
value a of the maximum digit and hence the redundancy 
factor p .  
5 Example 
We give now an example of how the proposed algorithm 
works. For the sake of simplicity we perform a division 
with r = 8, U = 7, 2 = (0.6670666)s and d = D = 
(0.7644162)s. From equations (8), (14) and (15) we deduce 
that 6 = 5 and t = 4, hence it results that Dmin = 0.96875 
and that a speculation is correct if -0.75 5 f ~ u c ( ~ w [ j ] )  5 
0.625. If -0.1875 5 3[ j ]  5 0.0625 a change to radix 
32 is performed, whereas if -0.4375 5 G [ j ]  < -0.1875 or 
0.0625 < G [ j ]  5 0.3125 thealgorithmswitches toradix 16. 
~ ~ ~ n t ( r t Z ~ L i i [ ~ ] ) ~  
.^lInt(r3[j])  + 1 I otherwise 
if B 5 frac(rG[j]) 5 B - E The higher ;adices are determined by equation (4). Since 
according to this equation the higher the radix, the smaller 
the bounds, to obtain an implementation with a switching (15) 
% + I  = 
29 
-0.75 5 f r u c ( r G [ j ] )  5 0.625 
Selection 
w[O] = (0.6670666)s qo = 0 
2’rw[O] = (6.670666)s 
-qlD = (-6.6575436)8 
rG[O] = 6.8125 
w[1] = (0.01 11222)s q1 = Int(r&[O]) + 1 = 7 
w[2] = (0.4451 1)s q 2  = rnt(T413) = 0 
2’r~[l] = (0.44511)s 
- 4 2 0  = (-0.0)s 
rG[1] = 0.5 
2’rw[2] = (4.451 1)s 
- 9 3 0  = 
rG[2] = 4.5625 
( -3.72207 1 )s 
~ [ 3 ]  = (0.527007)s q 3  = Int(r&[2]) = 4 
2’rw[3] = (5.27007)s 
-q4D = (-4.7065072)s 
rG[3] = 5.3125 
w[4] = (0.3613606)s 94 = Int(rG[3]) = 5 
2’r~[4] = (3.613606)s 
-qsD = (-3.722071)8 
rG[4] = 3.75 
w[5] = (-0.106263)s 95 = Int(rG[4]) + 1 = 4 
2pT‘W[5] = (-4.31314)s 
-qsD = (3.7220710)s 
rG[5]) = -4.4375 
~ [ 6 ]  = -(0.371047)s 96 = -(I Int(T&[5]1 -I- 1 )  = -4 
Q = (0.704544)s = 0.877288 
clear that the algorithm has a variable execution time. This 
is due to the three different iteration types that determine the 
latency of an iteration and to the HRS. In fact, the possibility 
to switch to a higher radix permits to increase the number of 
bits computed per iteration (as for iterations 2 and 6) reduc- 
ing the overall number of iterations necessary to complete 
a division and achieving an asynchronous behaviour. In the 
example described in this section, the division ends after 6 
iterations instead of 7. 
#Bit Adv. (p) Iter. 
p = o  n 
p = 2  SLOW 
p = o  FAST 1 
FAST I p = o  
p = o  FAST 
p = 2  SLOW 1 
p = l  FAST 
71 
Figure 4. Example of Radix-8 Division with 
HRS. 6 Evaluation 
adder used for quotient-digit speculation is implemented by 
cascading two smaller adders that compute the integer and 
the fractional part of the partial residual in carry-assimilated 
To evaluate the performance of the schemes that imple- 
ment the algorithm we have described, we used a standard 
30 
cell family and designed two double-precision floating-point 
dividers (the width of the datapath affects only the area es- 
timates). Since a design depends on many parameters, we 
have not done a complete analysis of the solution space, but 
performed some reasonable designs to evaluate and com- 
pare. 
In Section 1, we have dealt with one of the major concerns 
when designing a recurrence divider, that is the choice of 
the radix. A high radix may lead to improved performances 
only if the increase of the iteration time is kept as small as 
possible. As a consequence a tradeoff between radix and 
cycle time must be found, since using a high radix does not 
lead necessarily to improved speedups. For this reason the 
performances of the different algorithms reported in Table 2 
are analysed for several radices, in order to identify the 
solution that better matches the design constraints. 
In particular, in order to compare our designs with the 
ones described in [ 10,5], we have used the same 1-pm stan- 
dard cell CMOS library and the same design tools. Delays 
and areas are expressed as multiples of the delay and the area 
of a two-input NAND gate with a fanout of three NAND 
gates (the size of a two-input NAND gate is 12.5 x 47.5 
pm2, the delay of the unloaded gate is 0.24 ns). Some sim- 
ple modules have been designed by hand (multiplexers and 
CSA’s), whereas SIS [ 171 has been used for the synthesis of 
the decoders and error detection function. SIS has always 
been guided to optimize delay at the expense of increasing 
the area. Fan-in and fan-out capacitances (but not routing) 
have been considered for delay calculations. In Table 2, we 
report the final results. In our estimations we take into ac- 
count also the extra iterations necessary for operand scaling. 
We consider the conventional dividers described in [ 101 and 
the speculative dividers both with partial advance and no 
advance reported in [5] and compare these implementations 
with the speculative asynchronous dividers that realize the 
proposed algorithm with HRS. We compare all the imple- 
mentations both in terms of speedup and area and the results 
are normalized with respect to the conventional radix-2 case. 
All the measures performed are summarized in Figure 7. To 
Module 
2 x 3b CSA 
3 x 3b CPA 
1 x 54b 3-input MUX 
Area 
40 
66 
240 
1 x 54b CSA 
OSEL u -  U 
II total 11 875 
Table 3. Area of Basic Block of a Radix-2 Self- 
Timed Stage [19]. 
6.1 Comparison with Other Asynchronous Design 
Methodologies 
Although the designs described in [19, 13, 121 are 
pipelined and use differential DCVSL gates and a dual-rail 
handshake for stage synchronization, it would be interesting 
to perform some rough performance and area estimations in 
order to compare these design methodologies with the one 
proposed in this paper. Since the designs reported in [ 13, 121 
are shared units for division and square root we prefer to 
compare our implementation only with [19]. 
Table 3 reports normalized delays and area of the basic 
blocks that form a stage of the self-timed loop. In [19], it 
has been determined an analytical expression to compute the 
31 
A 
-8d -28 +8d +2d -2d +2d 
............... 
Module 
5 x Div. Stage 
1 x 54b 2-input MUX 
- -  
Area Delay 
4375 11.5 
160 1.4 
switched to I 
i delays 
4- buffer 
15.4111.4 + 
. scalingl I csn I 
.3/14.3 
scalingl 
iteration 24.1121.1117.1 % 
Switch 
25.7113.1119.41 I 
r ((1.6. 32, 
- mal.ing/ 
i terntion 
33.5/31.2m.a 
W I j + 1 1  
Figure 5. Block Diagram for a Speculative Radix-16 Divider with HRS. 
average case delay of each stage of the pipeline that accord- 
ing to our estimations is about 11.5 2-input NANDs delay. 
Table 4 reports the estimated average delay and area occu- 
pation of the schemes proposed in [ 191. Hence, according to 
I] 2 x 54b register I( 430 I not critical 1 
[I total 11 4965 I 648 
our estimations, the pipelined divider with self-timed loop 
has a speedup of 1.7 and an area factor of 1.6 with respect 
a conventional radix-2 synchronous divider but is still un- 
competitive with the design proposed in this paper for which 
we estimated a speedup of 3.1 with respect a conventional 
radix-2 synchronous divider. 
7 A Design Example 
This section describes the implementation of a radix- 16 
divider with HRS. Further details, as well as the description 
of a radix-64 unit for division, may be found in [4]. 
64) 
7.1 Radix-16 with HRS 
We preferred a bundled-data approixh basically for two 
reasons. The former is to limit the area occupation of the 
implementations, thus achieving better area x delay prod- 
ucts. The latter is the need to compare the proposed design 
approach with the ones reported in [ 10, 51 using objective 
criteria. We think that further investigation is necessary to 
study the impact of a dual-rail implementation on the per- 
formances of the proposed algorithm. since the increased 
wiring load typical of such realization may not lead necessar- 
ily to a better throughput. The block diagram of the proposed 
implementation is shown in Figure 5, aind the characteristics 
are shown in Table 2. The quotient digic speculated by the se- 
lection funciion is qh + @, where Qh € (-8, -4 ,o ,  +4, +8} 
and qr E { - 1,0, + 1 }. Therefore we choose a = 9, since, 
according to equation (17) and to the design parameters of 
Table 2, results that a 5 10. Moreover, in this way we 
avoid the generation of Qh + ql = 11 1.hat would imply the 
use of three carry-save adders increasing both iteration time 
and hardware complexity. Table 5 reports area and delay 
characteristics of this design. Since Qh is the highest-weight 
component of the quotient digit, the decoding logic DEC is 
simpler and faster than for ql.  During, the synthesis of the 
decoder DEC, SIS has been guided to reducing the delay of 
Qh since it is the critical path. 
Note that two modules are required: one for quotient 
digit speculation and one for switching to an other radix. 
The latter controls the amount of the shift at the end of an 
iteration and permits an advance of one bit in case of switch- 
32 
(to decoder) 
t o  
switched 
delays 
buffers 
H R S  
e CO Iter. Type ,111_ 
2.2‘ (from q 3 d )  
8 x 2.6 1.8’ 
26 not critical 
( to  decoder) 
Figure 6. Block Diagram of the Selection Logic. 
... .- 
switched delays 
registers 
Module 1, fa 1 ~ e l a y  j 
error detection 15.5/11.5 
mux (for q 3 d )  2 x 302 1.8’ 
mux (scaling) 1650 1.4*, 1.6 o r  1.8 
speculation 10.4/6.4’ (for q h )  
11.5/7.5 (for q 0  
~ 
71 not critical 
170x 3 7.8’ 
mux (err. det.) 12 I 1.4 
mux (HRS) 1 1  230 I 1.6’ H 
convert t 1 0 1  
total 
I 1  I 
csa \I 2 x 360 1 4.2’ (from T W [ J ] )  
~~ 
1900 [ 
5870 [ 27.2 11 
II
I
Table 5. Area and Delay of the Radix-16 
Scheme with HRS. 
ing to radix 32 or two bits in case of switching to radix 64. 
These two modules are conceptually identical, but evaluate 
different ranges of w[j]. The CSA has been designed as a 
radix-2 full adder and its worst-case delay is determined by 
two cascaded XOR gates. However, the outputs of the q j  d 
multiplexers have been connected to the last gate in order 
to reduce the critical path delay (the same optimization has 
been used in the other designs). This approach may not 
be used for the residual since it is represented in redundant 
form and requires two signals. Moreover, to save hardware, 
we use the divider carry-save chain to compute the scaling 
factors as well, this implies the insertion of multiplexers to 
switch between scaling and division iteration. 
For what concerns the computation of the delays that 
must be matched by the switched-delay chain we assume 
that all the gates along the three possible delay paths have 
a propagation delay equal to the worst-case one. It must be 
remarked that, in order to compare our designs with [IO, 51, 
in our performance analysis we neglect the delay due to 
wiring. This means that we give only a rough estimation of 
the real delays. 
We think that more realistic estimations may be per- 
formed by assuming a draft layout plan in order to compute 
the worst-case load due to wires, that, especially in the case 
of components with large fan-out such as the buffers that 
drive the q jd  multiplexers, may lead to significant delays. 
7.1.1 Selection Function 
The block diagram is shown in Figure 6. The integer and the 
fractional part of r G [ j ]  are assimilated by two fast adders, 
so that the carry-out signal CO may be used jointly with 
signal the error signal e to select the iteration type as shown 
in Figure 6. In fact, when CO is “O”, digit $+, is generated 
after the delay of one adder, whereas when CO is “l”, qjt, 
is generated after the delay of two adders. 
is represented in two’s complement on log, r+  
1 bits, whereas digit qj t l  E { -1, $1) is encoded on two 
bits (each line encodes one of the possible values). 
Digit qj  
8 Summary and Conclusions 
The division method that has been presented in this paper 
is based on the speculation of the result digit. Speculation 
is performed by using an adder, this results in implementa- 
tions that are faster than conventional ones and that require 
less hardware resources with respect to other speculative 
approaches analyzed in this paper. The selection function 
permits to switch between three kinds of iterations: a slow, 
a medium and a fast one. The slow iteration is the one that 
is performed when a correction has to be carried out. The 
performance of the algorithm may be enhanced using HRS, 
namely switching to a higher radix when the partial residual 
falls into certain bounds. The HRS permits to obtain imple- 
mentation with a variable number iterations. We performed 
several designs using the same technology and determined 
the relative speed and area. The results are summarized in 
Figure 7. 
From the analysis of the experimental results reported in 
Figure 7 we may infer that the proposed approach (HRS) 
33 
r512 
................................................................. ............................... 
3,,)- 
P a 
2 2.5-  e; 
2.0- 
- 
............................ 
4x4 
L ......... 
............. r.8 ......... 1- 
11 
r64 
./ ... ...... 
-.e ............................................................... .............................................................................................. I re.=. 
conventional 
speculative 
speculative (p.a.) 
speculative (HFLS) 
1.042 I I I 1 I I I 
I .o 1.5 2.0 2.5 3.0 3.5 4.0 4.5 
Area Factor 
Figure 7. Summary of the Implementations for Different Dividers. 
has, in  case of radix-64, a latency comparable with the spec- 
ulative radix-5 12 divider with partial advance (denoted with 
“p.a.”) described in [5] but an area saving of about 35%. In 
addition, as showed in Section 6, the schemes that imple- 
ment the proposed algorithm have a delay x area saving of 
about 65% with respect the conventional radix-5 12 unit and 
of about 25% with respect to the speculative radix-5 12 unit 
with partial advance. They also outperform the efficiency 
of other known asynchronous designs. 
References 
[I] D. E. Atkins. Design of the Arithmetic Unit of Illiac 111: Use 
of Redundancy and Higher Radix Methods. IEEE Transac- 
tion on Computers, C-19(8):720-733, August 1970. 
[2] J. A. Brzozowski and C.-J. H. Seger. Asynchronous Circuits. 
Springer-Verlag, New York, 1994. 
[3] B. Chappell. The Fine Art of IC Design. IEEE Spectrum, 
36(7):30-34, July 1999. 
[4] G. Cometta and J. Cortadella. Multi-Radix Asynchronous 
Dividers. Technical Report UPC-DAC-2000-57, Universi- 
tat Politkcnica de Catalunya, Departament d’ Arquitectura de 
Computadors, September 2000. 
[5] J. Cortadella and T. Lang. High-Radix Division and Square 
Root with Speculation. IEEE Transaction on Computers, 
[6] M. D. Ercegovac and T. Lang. On the Fly Conversion of 
Redundant into Conventional Representations. IEEE Trans- 
action on Computers, C-36(7):895-897, July 1987. 
[7] M. D. Ercegovac and T. Lang. Fast Radix-2 Division with 
Quotient Digit Prediction. Journalof VLSISignal Processing, 
2(1): 169-1 80, January 1989. 
[8] M. D. Ercegovac and T. Lang. Simple Radix-4 Division 
with Operands Scaling. IEEE Transaction on Computers, 
C-39(9):1204-1207, September 1990. 
[9] M. D. Ercegovac, T. Lang, and P. Montuschi. Very-High 
Radix Division with Prescaling and Selection by Rounding. 
IEEE Transaction on Computers, C-43(8):909-918, August 
1994. 
C-43(8):919-931, August 1994. 
[ 101 M.D. Ercegovac and T. Lang. Division and Square Root. 
Digit-Recurrence Algorithms and Implementations. Kluwer 
Academic Publishers, Nonvell, MA, 1994. 
[ I  I ]  D. Kearney and N. W. Bergmann. Bundled Data Asyn- 
chronous Multipliers with Data Dependent Computation 
Times. In 3rd IEEE Symposium on .4dvanced Research in 
Asynchronous Circuit and Systems, pages 186-197, 1997. 
[ 121 G. Matsubara and N. Ide. A Low Power Zero-Overhead Self- 
Timed Division and Square Root Unlit Combining a Singe- 
Rail Static Circuit with a Dual-Rail Dynamic Circuit. In 
Symposium on Asynchronous Design Methodologies, pages 
[ 131 G. Matsubara, N. Ide, et al. 30-11s 55-kt Radix-2 Division and 
Square Root Using a Self-Timed Circuit. In 12th Symposiutn 
on Computer Arithmetic, pages 198-209, 1995. 
[I41 P. Montuschi and T. Lang. Boosting Very High-Radix Divi- 
sion with Prescaling and Selection By Rounding. In 14th 
IEEE Symposium on Computer Aritihmetic, pages 52-59, 
1999. 
[I51 S. M. Nowick. Design of a Low-Latency Asynchronous 
Adder Using Speculative Completion. IEE Proceedings on 
Computer Digital Techniques, 143(5):301-307, September 
1996. 
[I61 S. M. Nowick, K. Y. Yun, P. A. Beercl, and A. E. Dooply. 
Speculative Completion for the Design of High-Performance 
Asynchronous Dynamic Adders. In Symposium on Asyn- 
chronous Design Methodologies, page:-, 21 0-223,1997. 
[I71 E. M Sentovich, K. J. Singh, Lavagno L., et al. SIS: A 
System for Sequential Circuit Synthesis. Technical report, 
Electronics Research Laboratory-EEC!$ Department Univer- 
ity of California, Berkeley, May 1992. 
[I81 G. S .  Taylor. Radix-16 SRT dividers with Overlapped Quo- 
tient Selection Stages. In 7th IEEE Symposium on Computer 
Arithmetic, pages 64-7 I ,  1985. 
[I91 T. E. Williams and M. Horowitz. A 16011s 54-bit CMOS 
Division Implementation Using Self-Timing and Simmetri- 
cally Overlapped SRT Stages. In loth IEEE Symposium on 
Computer Arithmetic, pages 230-217,1991. 
198-209,1997. 
34 
