A radix-16 SRT division unit with speculation of the quotient digits by Gianluca, Cornetta & Cortadella, Jordi
A Radix-16 SRT Division Unit with Speculation of the Quotient Digits 
Gianluca Cornetta
Computer Architecture Dept.
Universitat Polite`cnica de Catalunya
08034 Barcelona—Spain
E-mail: cornetta@ac.upc.es
Jordi Cortadella
Software Dept.
Universitat Polite`cnica de Catalunya
08034 Barcelona—Spain
E-mail: jordic@lsi.upc.es
Abstract
The speed of a divider based on a digit-recurrence al-
gorithm depends mainly on the latency of the quotient digit
generation function. In this paper we present an analyti-
cal approach that extends the theory developed for standard
SRT division and permits to implement division schemes
where a simpler function speculates the quotient digit. This
leads to division units with shorter cycle time and variable
latency since a speculation error may be produced and a
post-correction of the quotient may be necessary. We have
applied our algorithm to the design of a radix-16 specula-
tive divider for double precision floatingpoint numbers, that
resulted to be faster than analogous implementations.
1. Introduction
Most VLSI implementations of the division operation
are based on digit recurrence; such class of algorithm, also
known as SRT, guarantees a good tradeoff between imple-
mentation cost and performance. It is an iterative algo-
rithm with linear convergence, i.e. a digit q
j
of the result
is calculated at each iteration. The quotient after k itera-
tions is Q[k] =
P
k
j=1 qj  r
 j
, r being the radix. Several
techniques proposed to improve the execution time of the
algorithm, including, prediction, operands prescaling and
overlapping of several selection stages, are reported in the
book of Ercegovac and Lang [4]. Speculation of quotient
digits, as a technique to speedup the division algorithm, was
first reported in [2] and then extended to high-radix division
and square root in [3]. The main differences between the
method proposed in [2, 3] and the one we have developed
are the following:
1. The number of bits for the truncation of the divisor
and partial remainder to calculate the quotient digit is
This work has been funded by the Ministry of Education of Spain
under CICYT, TIC 98-0410.
analytically derived;
2. The estimations of w and d to be used for specula-
tion, error detection and correction are computed as a
function of the desired speculation error;
3. The correction is always carried out in one extra cycle
whichever be the error.
The rest of the paper is structured as follows: in section 2
we review the SRT algorithm and introduce the notation that
we will use in this paper. Section 3 describes the proposed
algorithm. Section 4 applies the theory developed in the
previous section to the design of a radix-16 divider and
compares the performances with previous implementations.
Finally, in section 5, we draw up some conclusions.
2. SRT Division and Notation
In an SRT division scheme, the following recurrence is
applied repeatedly:
w[j + 1] = rw[j]  q
j+1d w[0] = x (1)
where w[j] is the residual after the j-th iteration, q
j+1 the
new quotient digit, d is the divisor and x the dividend. We
also assume that x and d are positive normalized fractions.
The quotient digit is a signed digit jq
j+1j  a with redun-
dancy factor  = a
r 1 , in particular we will consider digit
sets with   1. Furthermore, in order to guarantee the con-
vergence of the algorithm, the residual must be bounded,
that is:
  d  w[j]  +d (2)
The quotient digit q
j+1 is determined by a digit selection
function F whose arguments are an estimation wˆ[j] and ˆd
of both residual and divisor; that is:
q
j+1 = F

wˆ[j];
ˆ
d

(3)
As described in [4], one of the methods for quotient digit
selection is essentially a bound checking operation, i.e. a
comparison of wˆ[j] with a set of selection constantsm
k
that
may depend on d and such that L
k
 m
k
 U
k 1, where
L
k
= (  + k) d U
k
= ( + k) d
are respectively the lower and the upper bound of the selec-
tion interval for digit q
j+1 = k. A selection function must
satisfy two fundamental conditions [4]: containment and
continuity, that determine the range of the selection inter-
vals. However, a speculative selection function must only
satisfy the latter, since a correction is performed every time
the residual exceeds the correct bounds.
3. Quotient Digit Speculation
In case of speculation of the quotient digit the recurrence
for division becomes [3]:
w
s
[j + 1] = rw[j]  qs
j+1d (4)
where qs
j+1 is the speculated digit and ws[j + 1] the spec-
ulated partial remainder. If the speculation is correct,
w[j + 1] = ws[j + 1] and q
j+1 = q
s
j+1. In case of wrong
prediction both the partial remainder and the speculated quo-
tient digit have to be corrected. In particular the correct digit
q
j+1 2 fq
s
j+1 ; : : : ; q
s
j+1 1; qsj+1; qsj+1+1; : : : ; qsj+1+
g (with   1) results:
q
j+1 = q
s
j+1 + q
c
j+1 (5)
where qc
j+1 2 f ; : : : ; 1; 0;+1; : : :; g is the correction
digit. The wrong speculated remainder may be corrected
using the following relation:
w[j + 1] = ws[j + 1]  qc
j+1  d (6)
The selection bounds for digit qs
j+1 = k are:
L
s
k
= (  + k   )d U
s
k
= ( + k + )d (7)
where " = jw[j + 1]   ws[j + 1]j = d is the maximum
speculation error and is a function of d. Since Ls
k
< L
k
and U s
k
> U
k
, this results in a larger overlap between two
consecutive selection intervals. Consequently the selection
function will be simpler. The overlap between the intervals
may be increased for values of d close to 12 considering
the maximum speculation error for d = 1 (namely " = ).
Digit qs
j+1 is determined by a function F s that evaluates an
estimation of both the full precision residual and divisor,
that is:
q
s
j+1 = F
s
(w
0
[j]; d
0
) (8)
If t is the number of bits of the fractionary part of wˆ[j], then
the estimation w0[j] of the residual at the j-th iteration may
be obtained from wˆ[j] by discarding  least significant bits
of its fractionary part. To determine  we impose that the
overlap (for the worst case d = 12 ) between two adjacent
selection intervals must be greater or equal the truncation
error of the carry-sum redundant representation of w0[j],
namely:
(U
s
k
  L
s
k+1)d=1=2 = ( + ) 
1
2  2
 t+1
(9)
A carry-save form residual is represented by two bit vectors;
the former stores the sum bits whereas the latter the carry
bit of an addition. Imposing the continuity condition be-
tween two consecutive selection intervals (Ls
k
(d
i
+ 2 ) 
U
s
k 1(di)), we are able to find the number  of bits of the
estimation d0 of the divisor, which leads to:
(  +
1
2
) + (  + a  )2    (10)
The term d
i
represents the discretized divisor; namely
d
i+1 = di + 2  with d0 = 12 and 0  i < 2
 +1
.
3.1. Error Detection and Correction
A correct speculation is performed if and only if, at each
step j of the recurrence, the speculated partial remainder
w
s
[j] results to be bounded; that is:
  d  w
s
[j]  d (11)
Thus, in order to verify the correctness of a prediction we
may check if the partial remainder generated by a specula-
tion falls within the correct bounds expressed by (11). Fur-
thermore, the correction function has to guarantee that the
sufficient condition for convergence (2) be always respected.
Since a wrongly speculated partial remainder exceeds the
correct bounds expressed by (11) it is surely greater than
unity, as a consequence  integer bits of the remainder have
to be evaluated in order to detect a speculation error. As
" = d, the speculated remainder is included in the follow-
ing bounds:
  ( + )d  w
s
[j]  ( + )d (12)
Since   1 and d < 1, considering the worst case we
obtain:
 = dlog2(1 + )e + 1   r   1 (13)
Which also takes into account of the sign bit.
The speed of the comparison may be augmented by con-
sidering the estimations w00[j] and d00 of both ws[j] and d.
Namelyw00[j] = ws[j] 2 e+1 and d00 = d 2 f , which is
equivalent to truncate the carry-sum representation of w00[j]
to the e-th fractionary bit and d to the f-th fractionary bit;
thus we obtain:
  d
00
 w
00
[j]  +d
00
  2 e+1 (14)
In case w00[j] is out of bounds and close to one of the bounds
of (14) (for example the lower bound), we must impose that
the correction by qc
j
=  1 guarantees the convergence, that
is:
 d
00
+ 1  d  +d00   2 e+1
from which it follows that the amount of truncation is given
by the following equation:
(  
1
2
)  2 e+1 + 2 f (15)
Expression (14) restricts the range of the allowed residuals
expressed by (11). This leads, to be conservative, to per-
form some unnecessary corrections; moreover, in order to
guarantee a single correction cycle we have to extend the
correction digit set imposing that qˆc
j+1 = q
c
j+1 +  with
 2 f 1; 0;+1g. The new correction digit qˆc
j+1 is gener-
ated by an error detection and correction function F c such
that:
qˆ
c
j+1 = F
c
(w
00
[j]; d
00
) (16)
Simulations have shown that the best performances
are obtained for  = 1, which leads to qˆc
j+1 =
f 2; 1; 0;+1;+2g. As a consequence function F c re-
sults to be:
qˆ
c
j+1 =
8
>
>
<
>
:
 2 if w00[j]   Φ
 1 if  Φ < w00[j] <  d00
+1 if d00   2 e+1 < w00[j] < +Φ
+2 if w00[j]  Φ
(17)
The lower bound Φ and the upper bound Φ may be calculated
in a straightforward way. To compute this bounds we have
to impose that a correction by 2d produces a bounded
residual. This means that after a correction, the residual
w[j] is such that jw[j]j  00d. Thus we obtain:
Φ = (   2)d00   2 f+1   2 e+1 (18)
and
Φ = (2   )d00   2 f+1 (19)
3.2. Performance Evaluation
Performance evaluation of a design is based on the cal-
culation of the average number of cycles per quotient digit
(C
d
) and of the delay per quotient bit (D
b
). Let N
d
be the
number of divisions simulated, N
c
the number of correction
cycles that have been performed and m the size of the man-
tissa (in bits); the number of cycles per quotient digit may
be defined as [3]:
C
d
= 1 +
N
c
N
d
 dm= log2 re
(20)
If D is the cycle delay of the design, the delay per bit may
be defined as follows:
D
b
=
C
d
D
log2 r
(21)
4. Implementation and Timing
We have performed the design of the radix-16 unit imple-
menting our algorithm using a standard-cell library (ES2-
ECPD10) and SIS for logic minimization. The characteris-
tics of both our design and those presented in [3] are reported
in Table 1. Delays and area are expressed as a multiple of
the delay and area of a two-input NAND gate with a fanout
of three NAND gates. Figure 1 shows the implementation
Radix 16 16 [3] 512 [3]
a 12 12 320
# of CSAs 2 2 4
# bits of w0 (sum-carry) (6,6) (6,5) (12, 12)
# bits of d0 2 1 6
# bits of w00 (, e) (2,4) (3,5) (6, 4)
# bits of d00 2 4 3
cycle/digit 1.1 1.3 1.8
cycle delay 28.4 28.8 43.6
delay/bit 8 9.4 8.8
latency 416 489 458
area 5215 4900 8400
Table 1. Characteristics of the implementa-
tions.
of the speculative radix-16 divider with  = 1215 . The use
of a redundant digit set permits to perform carry-free addi-
tions using carry-save adders (CSA). Quotient conversion
from redundant to non-redundant form may be performed
on the fly, i.e. in parallel with the division operation, as
also described in [4]. The dashed line indicates the critical
path. In order to reduce the number of CSAs necessary to
implement all the multiples of d, digits qs
j+1 2 f 11;+11g
are not speculated since they would need three CSAs, conse-
quently they are obtained exploiting the correction function.
Figure 1 shows the cycle time of the divider. In order to
speedup the circuit, speculation at step j is overlapped with
the error detection at step j   1. Each digit qs
j+1 results to
be the sum of two contributions computed by two different
combinational blocks; namely:
q
s
j+1 = qh + ql
Where q
h
2 f8;4; 0g and q
l
2 f4;2;1; 0g. Fig-
ure 2 shows how the selection logic is implemented in case
of radix 16 in detail. The residual estimation in carry-
save form is first assimilated by a carry-propagation adder
(CPA) whose outputs are fed into two different combina-
tional blocks; the former computes q
h
and only needs the five
most significant bits of the assimilation, the latter computes
q
l
and needs the whole assimilation. As a consequence the
selection logic has a variable delay. Computation of digit q
h
buffer
CSA
CSA
REG.
spec./
corr.
9.2 11
15
16.4
20.6
28.4
{-8,-4,0,+4,+8} d w[j]
w s
w s
w’[j]
d’
w’’[j]
d’’ detection
error
correction
error
speculation
qh
ql
16
d{-4,-2,-1,0,+1,+2,+4}
12.8
speculation
at step j residual updating
error det.
at step j-1
correct speculation
wrong
speculation
correction
at step j-1
10.9
10.3
9.4
12.3
14.1
11.2
Figure 1. Implementation and Timing of the
Radix-16 Speculative Divider.
can be performed faster, due to the reduced inputs, in order
to shorten the critical path. On the other hand, computing
q
l
is a slower operation and is overlapped to q
h
computation
and to part of the delay introduced by the first CSA in the
chain (see Figure 2). What leads to improved performance
with respect to the implementations without partial advance
described in [3] is the different arrangement of the divisor
multiples along the adder chain. The choice of the multiples
is determined according to the frequency with which a digit
is selected. Due to this consideration, the choice of spec-
ulating digits qs
j+1 2 f 12;+12g, unlike [3] where such
digits are generated exploiting the correction logic, leads to
hit-ratios very close to 90% while the architecture proposed
in [3] has a hit-ratio of 70%. In addition, although using one
more bit to speculate, we achieve cycles delays comparable
to those of [3] designing the selection logic as reported in
Figure 2 (since the table that computes q
h
requires a num-
ber of bits smaller than the assimilation) and using boolean
relations [5] to synthesize the tables. Also the correction
function contributes to speedup the execution time of the
architecture since, unlike [3], it always needs only a single
cycle to correct an error.
5. Conclusions
Quotient digit speculation is an alternative to conven-
tional division techniques. Assimilating a reduced number
of bits of both residual and divisor may lead to a fast se-
lection function and hence to a speedup of the execution
time. Nevertheless the speculated digit may be incorrect.
The correctness of the selection is checked by an error de-
tection function that verifies whether the speculated residual
falls within the correct bounds or not. In case of incorrect
speculation the algorithm rolls back and the digit is cor-
rected. This operation is performed by an error correction
function. Because of the possible rollbacks, the execution
time is variable, i.e. we can have different cycle times for
each iteration. As a consequence we have transformed a
fixed-latency unit into a variable latency one running with a
faster clock cycle and with a functioning that reminds that of
the telescopic units introduced in [1]. Moreover, due to its
variable latency, our algorithm results particularly appealing
for asynchronous implementations too.
xxxxx.x
xxxxx.x
xxxxx.x
CPA
combinational combinational
logic logic
w’[j]
d’
q q
2
(carry-sum)
h l
qh
ql
T
CSA
Figure 2. Implementation and Timing of the
Radix-16 Selection Function.
References
[1] L. Benini, E. Macii, and M. Poncino. Telescopic Units: In-
creasing the Average Throughput of Pipelined Designs by
Adaptative Latency Control. In 34th Design Automation Con-
ference, 1997.
[2] J. Cortadella and T. Lang. Division with Speculation of Quo-
tient Digits. In 11th Symposium on Computer Arithmetic,
pages 87–94, 1993.
[3] J. Cortadella and T. Lang. High-Radix Division and Square
Root with Speculation. IEEE Transaction on Computers, C-
43(8):919–931, August 1994.
[4] M.D. Ercegovac and T. Lang. Division and Square Root.
Digit-Recurrence Algorithms and Implementations. Kluwer
Academic Publishers, Norwell, MA, 1994.
[5] Y. Watanabe and R. E. Bryant. Heuristic Minimization of
Multipled-Valued Relations. IEEE Transaction on Computer-
Aided Design of Integrated Circuits, 12(10):1458–1472, Oc-
tober 1993.
