A new binary floating-point division algorithm and its software implementation on the ST231 processor by Jeannerod, Claude-Pierre et al.
A new binary floating-point division algorithm and its
software implementation on the ST231 processor
Claude-Pierre Jeannerod, Herve´ Knochel, Christophe Monat, Guillaume Revy,
Gilles Villard
To cite this version:
Claude-Pierre Jeannerod, Herve´ Knochel, Christophe Monat, Guillaume Revy, Gilles Villard.
A new binary floating-point division algorithm and its software implementation on the ST231
processor. LIP research report RR2008-39. 2009. <ensl-00335892v2>
HAL Id: ensl-00335892
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00335892v2
Submitted on 18 Mar 2009
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
A new binary floating-point division algorithm and
its software implementation on the ST231 processor
Claude-Pierre Jeannerod1, 2 Herve´ Knochel4
Christophe Monat4 Guillaume Revy2, 1 Gilles Villard3, 2, 1
1INRIA, 2Universite´ de Lyon, 3CNRS, 4STMicroelectronics
Laboratoire LIP, E´cole normale supe´rieure de Lyon — 46, alle´e d’Italie, 69364 Lyon cedex 07, France
Compilation Expertise Center, STMicroelectronics — 12, rue Jules Horowitz BP217, 38019 Grenoble cedex, France
Abstract
This paper deals with the design and implementation of
low latency software for binary floating-point division with
correct rounding to nearest. The approach we present here
targets a VLIW integer processor of the ST200 family, and
is based on fast and accurate programs for evaluating some
particular bivariate polynomials. We start by giving ap-
proximation and evaluation error conditions that are suf-
ficient to ensure correct rounding. Then we describe the
heuristics used to generate such evaluation programs, as
well as those used to automatically validate their accuracy.
Finally, we propose, for the binary32 format, a complete
C implementation of the resulting division algorithm. With
the ST200 compiler and compared to previous implemen-
tations, the speed-up observed with our approach is by a
factor of almost 1.8.
Keywords: binary floating-point division, correct round-
ing, polynomial evaluation, code generation and validation,
VLIW integer processor.
1. Introduction
Although floating-point divisions are less frequent in ap-
plications than other basic arithmetic operations, reducing
their latency is often an issue [16]. Since low latency im-
plementations may typically be obtained by expressing and
exploiting instruction parallelism, intrinsically parallel al-
gorithms tend to be favored. Some examples are [18, 4] for
hardware, and [12, 19] for software.
In this paper we focus on designing and implementing
low latency floating-point division software for STMicro-
electronics’ ST231 processor. For this 32-bit VLIW integer
processor, various correctly-rounded binary floating-point
division algorithms have already been implemented [19].
There, the lowest latency measured is of 48 cycles; it was
obtained by exposing instruction-level parallelism (ILP) by
means of a suitable combination of univariate polynomial
evaluation and Goldschmidt’s method, in a way similar
to [18]. However, we will see in this paper that much more
ILP can in fact be exposed by relying exclusively on poly-
nomial evaluation, leading to a latency of 27 cycles and thus
a speed-up by a factor of almost 1.8. To get this result we
shall take three main steps, which we summarize now.
As a first step, we extend to division the polynomial-
based algorithm introduced in [8] for square rooting. This
extension provides a set of approximation and evaluation
error conditions that will be proven to be sufficient to ensure
correct rounding.
The second step consists in generating an efficient poly-
nomial evaluation program, and in the automatic validation
of its accuracy. On the one hand, the generation relies on
several heuristics aimed at maximizing ILP exposure. On
the other hand, checking that such a program is accurate
enough turns out to be more involved than for square root,
and will require a specific splitting strategy.
Third, besides the polynomial evaluation program, ad-
ditional subroutines and associated C code sequences are
proposed. Those include sign and exponent computation,
rounding, and handling overflow, underflow and special val-
ues. This provides a complete division code (for rounding
to nearest and without subnormal numbers), optimized for
the ST231 processor yet portable.
The paper is organized as follows. After some prelim-
inaries and notation in Section 2, the three contributions
above will be described in Sections 3, 4, 5, respectively.
Some experimental results obtained with the ST231 proces-
sor and the ST200 compiler will be reported in Section 6,
and we shall conclude in Section 7.
2. Preliminaries and notation
Floating-point data and rounding. The floating-point
data we shall consider are ±0, ±∞, quiet or signaling
Not-a-Numbers (qNaN, sNaN), as well as normal binary
floating-point numbers
x = (−1)sx ·mx · 2ex , (1)
with sx ∈ {0, 1}, mx = (1.mx,1 . . .mx,p−1)2 and ex ∈{emin, . . . , emax}. (In particular subnormal numbers [1, §3.3]
will not be considered.) The precision p and extremal expo-
nents emin and emax are assumed to be integers such that p ≥ 2
and emin = 1− emax.
The rounding attribute chosen here is “to nearest even”
(roundTiesToEven [1, §4.3.1]) and will be referred to as RN.
Except for some special input (x, y) for which special val-
ues must be delivered as detailed in Section 5.4, the standard
requires that RN(x/y) be returned whenever x and y are as
in (1).
Correctly rounded division. That RN(x/y) essentially
reduces to a correctly rounded ratio of (possibly scaled) sig-
nificands can be recalled as follows. First, using RN(−x) =
−RN(x) gives
RN(x/y) = (−1)sr · RN(|x/y|),
where sr is the XOR of sx and sy . Then, taking c = 1 if
mx ≥ my , and 0 otherwise, one has |x/y| = ` · 2d, where
` = 2mx/my · 2−c, d = ex − ey − 1 + c. (2)
Since bothmx andmy are in [1, 2), ` is in [1, 2) as well, and
` · 2d will be called the normalized representation of |x/y|.
Tighter enclosures of ` can in fact be given:
Property 1. If mx ≥ my then ` ∈ [1, 2 − 21−p] else ` ∈
(1, 2− 21−p).
Proof. If mx ≥ my then c = 1, and we deduce from 1 ≤
mx ≤ 2− 21−p and 0 < 1/mx ≤ 1/my ≤ 1 that 1 ≤ ` ≤
2 − 21−p. If mx < my then mx ≤ my − 21−p and thus
` ≤ 2−22−p/my . Hence, usingmx ≥ 1 and 1/my > 1/2,
we obtain 1 < ` < 2− 21−p as desired.
These enclosures of ` will be used explicitly when handling
underflow in Section 5.3. For now, we simply note that both
of them give RN(`) ∈ [1, 2− 21−p], so that
RN(|x/y|) = RN(`) · 2d.
If emin ≤ d ≤ emax then we shall return the normal number
RN(x/y) = (−1)sr ·mr · 2er , where
sr = sx ⊕ sy, mr = RN(`), er = d. (3)
Else, d is either smaller than emin or greater than emax. Since
subnormals are not supported, some special values will be
returned in either case, as detailed in Section 5.3. Thus,
to get RN(x/y) the main task consists in deducing from
mx and my the correctly rounded value RN(`) in (3), the
sign sr and exponent er being computable in parallel and at
lower cost (see Section 5.1).
Correct rounding from one-sided approximations.
Among the many methods known for getting RN(`) (see [5,
§8.6] and the references therein), the one we focus on in
this paper uses one-sided approximations: as shown in [5,
p. 459], this method reduces the computation of RN(`) to
that of an approximation v of ` such that
−2−p < `− v ≤ 0. (4)
Here v is representable with, say, k bits while ` has in most
cases an infinite binary expansion (1.`1 . . . `p−1`p . . .)2.
Once such a v is known, correct rounding follows easily
(see [5, p.460] and Section 5.2) and it remains to deduce
from each possible pair (mx,my) a value v that satisfies (4).
Before presenting in Sections 3 and 4 a novel approach
for computing v in a certified and efficient way, we give
next a brief description of the implementation context.
Software implementation on integer processors. Our
implementation of division is for the binary32 format of [1]:
k = 32, p = 24, emax = 127.
It will consist of a piece of C code using exclusively 32-bit
integers, for input/output encoding as well as for intermedi-
ate variables.
Concerning the input data x, y the standard encoding into
32-bit unsigned integers X,Y is assumed [1, §3.4]. For ex-
ample, the bit stringX31 . . . X0 ofX =
∑31
i=0Xi2
i is such
that X31 = sx (sign bit of x),
∑7
i=0Xi+232
i = ex + 127
(biased exponent of x, for x normal), and Xi = mx,23−i
for i = 0, . . . , 22 (fraction bits of x). The output is encoded
similarly and our division code eventually takes the form of
a C function like the one below:
unsigned int binary32div( unsigned int X ,
unsigned int Y ) {...}
Concerning the operations on input or intermediate vari-
ables, we assume available the basic arithmetic and logi-
cal operators +, &, etc. as well as a max instruction and
a multiplier mul defined for A,B ∈ {0, . . . , 232 − 1} as
mul(A,B) = bA · B · 2−32c. Here b·c denotes the usual
floor function, and mul thus gives the 32 most significant
bits of the exact product A ·B.
Therefore, up to emulating max and mul as shown in
Appendix A, all we need is a C compiler that implements
32-bit arithmetic. However some features of the ST231 pro-
cessor and compiler have influenced the design and opti-
mizations described in Sections 4 and 5. In particular, 4
instructions can be launched simultaneously, among which
at most two mul instructions. Also, the latency of max and
of the other basic operators is 1, while it is 3 for mul.
3. Sufficient conditions for correct rounding
This section gives sufficient conditions for (4) to hold.
To do so we extend to division the bivariate polynomial-
based framework introduced in [8] for computing correctly-
rounded floating-point square roots. A preliminary step is
to restrict (4) to the following more symmetric, but only
slightly stronger, condition∣∣(`+ 2−p−1)− v∣∣ < 2−p−1. (5)
Sufficient conditions to have (5) will be obtained by intro-
ducing a suitable bivariate polynomial approximant along
with approximation and evaluation error bounds. We will
rely on these conditions for designing our validation ap-
proach in Section 4.2, especially for providing hint values
to the software tools Sollya and Gappa that we use.
3.1 Polynomial approximation
We start by interpreting the rational number ` + 2−p−1
as the value of a suitable function of mx and my . Defining
F (s, t) = 2−p−1+ s/(1+ t), we have that `+2−p−1 is the
exact value of F at the particular point
(s∗, t∗) = (2mx · 2−c,my − 1). (6)
Whenmx andmy vary then s∗ and t∗ range in the domains
S = [1, 2−21−p]∪ [2, 4−23−p] and T = [0, 1−21−p].
(In particular, the second interval for S comes from the fact
that c = 0 impliesmx ≤ 2− 22−p.)
Then, following [8], the next step is to approximate F
over S × T by a suitable bivariate polynomial P . Since
F is linear with respect to s, one can reduce to univariate
approximation by taking
P (s, t) = 2−p−1 + s · a(t), (7)
with a(t) a polynomial approximating 1/(1 + t) over T . If
we define
α(a) = max
t∈T
|1/(1 + t)− a(t)| , (8)
then the approximation error for F at (s∗, t∗) satisfies
|F (s∗, t∗)− P (s∗, t∗)| ≤ (4− 23−p)α(a).
Since from (5) the overall error must be less than 2−p−1, we
take
α(a) < 2−p−1/
(
4− 23−p) (9)
as a target approximation error bound for computing the ap-
proximant a.
The approximant construction is done using the software
environment Sollya1, from the function 1/(1 + t), the do-
main T , and the error bound (9). For reducing the final
cost we choose a polynomial of smallest degree δ that sat-
isfies the above constraints. For p = 24, the Sollya function
guessdegree leads to δ = 10. Then, with the additional
constraint that the coefficients should have no more than
k = 32 fraction bits, we deduce a suitable polynomial a(t)
from the remez function and by truncation. One obtains
a(t) =
∑10
i=0 ait
i, with each |ai| defined by a 32-bit un-
signed integer Ai such that |ai| = Ai · 2−32. In our case,
it turns out that 1 > |a0| > · · · > |a10| and that the signs
alternate, so that
ai = (−1)i ·Ai · 2−32 ∈ (−1, 1), 0 ≤ i ≤ 10. (10)
Finally, a certified bound on the actual approximation error
is obtained using Sollya’s infnorm function:
α(a) ≤ 3 · 2−29 ≈ 2−27.41 < 2
−25
4− 2−21 ≈ 2
−27. (11)
3.2. Subdomain-based error conditions
Once a is available, for establishing (5) we need to con-
sider the rounding errors of the evaluation ofP given by (7).
In [8], for the design of a correctly-rounded square root
function in a similar context, errors have been bounded by
a single value, on the whole domain S × T . Here we ex-
tend this approach since the room left by (11) between the
actual approximation error and 2−p−1, is not sufficient for
taking into account rounding errors. (We refer to the exper-
imental data in Section 4.2.) The validation of our division
algorithm will require to see the interval T as a union of n
subintervals: T = ⋃ni=1 T (i) and, accordingly, the approx-
imation error α(a) of (8) will then split up into
α(i)(a) = max
t∈T (i)
|1/(1 + t)− a(t)| , 1 ≤ i ≤ n. (12)
The value v in (5) will result from the evaluation of P by
a finite precision program P that produces, for each subdo-
main S × T (i), the rounding error
ρ(i)(P) = max
(s,t)∈S×T (i)
|P (s, t)− P(s, t)| . (13)
Let i be such that (s∗, t∗) belongs to S × T (i). Then,
from (12) and (13), the overall error is eventually bounded
as
∣∣(`+ 2−p−1)− v∣∣ ≤ (4− 23−p)α(i)(a) + ρ(i)(P), and
we arrive at the following sufficient conditions for (5):
1http://sollya.gforge.inria.fr/ and [10].
Property 2. If the approximation and rounding errors sat-
isfy, for 1 ≤ i ≤ n,
(4− 23−p)α(i)(a) + ρ(i)(P) < 2−p−1 (14)
then (5) holds.
4. Evaluation code generation and validation
Here we present our strategy to produce automatically
an efficient evaluation program P along with an a posteriori
certificate on its accuracy (in the sense of Property 2).
For p = 24, the program P will implement a finite pre-
cision evaluation of P (s∗, t∗) = 2−25 + s∗ · a(t∗). Here a
is the polynomial obtained in Section 3.1 and whose coef-
ficients ai satisfy (10). The program will in fact store only
the 32-bit integers Ai, the coefficient signs being handled
through an appropriate choice of arithmetic operators (+ or
−) made when generating P . Concerning the evaluation
point (s∗, t∗) defined in (6), it can be encoded by a pair
(S, T ) of 32-bit unsigned integers such that
s∗ = S · 2−30, t∗ = T · 2−32. (15)
The computation of S and T from X and Y can be imple-
mented in C as follows:
Tx = X << 9; T = Y << 9; X8 = X << 8;
c = Tx >= T; Mx = X8 | 0x80000000;
S = Mx >> c;
The above code has been written with ILP exposure in
mind. For example, c is computed by comparing the trailing
significands mx − 1 and my − 1, rather than mx and my .
On ST231, this piece of code will typically take 3 cycles,
the delay between S and T being of 2 cycles.
Concerning the output of P , note first that by Property 1,
v in (4) must satisfy v ∈ [1, 2−2−p). Moreover, because of
the formats introduced in (10) and (15) and of the fact that
P is essentially a fixed-point code, the output of P will be
a 32-bit unsigned integer V representing v as
v = V · 2−30. (16)
(How to implement correct rounding using this format will
be detailed in Section 5.2.)
4.1. Evaluation program generation
Once the integers S, T,A0, A1, . . . , A10 are available,
we determine automatically an efficient programP for eval-
uating V by means of 32-bit additions/subtractions and mul
instructions. By efficient program we mean a way of paren-
thesizing the arithmetic expression P (s∗, t∗) that reduces
the execution latency as well as the number of mul instruc-
tions. This depends on a set of heuristics, still under devel-
opment, but that already enabled to produce an evaluation
program suitable for floating-point division on ST231.
Evaluation tree generation. Following [6], we generate
evaluation trees whose height is kept low through ILP expo-
sure, thus avoiding highly sequential schemes like Horner’s.
To do so, we proceed in two substeps. Given the polyno-
mial degree δ, the delay between S and T , and some la-
tencies for addition/subtraction and mul, we first heuristi-
cally compute a target latency τ forP , assuming unbounded
parallelism. This hint for the latency is a priori feasible,
still reasonably low. Then we generate automatically a set
of evaluation trees with height no more than this target la-
tency. An example of such a tree is given in Figure 1, whose
height corresponds to the target latency τ = 14. This la-
tency2 is more than three times lower than the latency of
(3+ 1)× 11 = 44 cycles that would result from computing
P (s∗, t∗) with Horner’s rule.
addition (1 cycle)
0x20
S
A0
T A1
r0
r1
r2
r3
S
T T
r4
r5
A2
T A3
r6
r7
r8
r9
r4 r5
r10
A4
T A5
r11
r12
r4
A6
T A7
r13
r14
r15
r16
r17
r18
r10
r4 r4
r19
A8
T A9
r20
r21
r4 A10
r22
r23
r24
r25
Vmultiplication (3 cycles)
Figure 1. Generated evaluation tree.
Our approach is heuristic in the sense that if no eval-
uation tree is found that satisfies the target latency τ , we
increase τ and restart the process. However, the practice
has shown that the number of evaluation trees found given
the target τ is usually extremely large, already for degrees
much lower than the degree δ = 10 we have here. Con-
sequently, we have implemented several filters in order to
reduce significantly this number during the generation, and
thus to speed up the whole process.
Arithmetic operator choice. A first filter consists in re-
stricting to evaluation trees for which all intermediate val-
ues are positive. This restriction allows to work with the
full precision k = 32, instead of loosing one bit because
of signed arithmetic. Such special trees can be found by
choosing the addition/subtraction operators appropriately,
for example considering a0 − (−a1t) rather than a0 + a1t,
2On the tree, it corresponds to computing r4, and then r22 up to V in
3 + 3 + 1 + 3 + 3 + 1 = 14 cycles.
for a1 is negative. If the sign of one of the intermediate
values computed by the tree changes when the input (S, T )
varies, then that evaluation tree is rejected. This first filter
is implemented using the MPFI3 library for interval arith-
metic. (An interval containing zero is interpreted as a sign
change for the corresponding variable.)
Scheduling verification. A second filter consists in
checking if a given evaluation tree can be scheduled without
latency increase on a simplified model of the real target ar-
chitecture. In our case of division on ST231, the evaluation
tree has a latency of 14 cycles in theory (that is, assuming
unbounded parallelism) as well as in practice. In fact, the
possible scheduling displayed in Table 1 uses only 3 out of
the 4 issues offered by the ST231. For now, this second fil-
ter is implemented using a naive list scheduling algorithm.
Issue 1 Issue 2 Issue 3 Issue 4
Cycle 0 r0 r4
Cycle 1 r6 r13
Cycle 2 r11 r20
Cycle 3 r1 r5 r22
Cycle 4 r2 r14 r19
Cycle 5 r12 r15 r21
Cycle 6 r7 r10 r23
Cycle 7 r3 r8 r24
Cycle 8 r16
Cycle 9 r17
Cycle 10 r9 r25
Cycle 11
Cycle 12 r18
Cycle 13 V
Table 1. Feasible scheduling on ST231.
Evaluation code. Finally, the first evaluation tree that
passes both filters is chosen, and the corresponding evalua-
tion program is printed out as a piece of C code. In our case,
the obtained evaluation program is presented in Listing 1.
unsigned int r0 = mul( T , 0xffffe7d7 );
unsigned int r1 = 0xffffffe8 - r0;
unsigned int r2 = mul( S , r1 );
unsigned int r3 = 0x00000020 + r2;
unsigned int r4 = mul( T , T );
unsigned int r5 = mul( S , r4 );
unsigned int r6 = mul( T , 0xffbad86f );
unsigned int r7 = 0xfffbece7 - r6;
unsigned int r8 = mul( r5 , r7 );
unsigned int r9 = r3 + r8;
unsigned int r10 = mul( r4 , r5 );
unsigned int r11 = mul( T , 0xf3672b51 );
unsigned int r12 = 0xfd9d3a3e - r11;
unsigned int r13 = mul( T , 0x9a3c4390 );
unsigned int r14 = 0xd4d2ce9b - r13;
unsigned int r15 = mul( r4 , r14 );
unsigned int r16 = r12 + r15;
unsigned int r17 = mul( r10 , r16 );
unsigned int r18 = r9 + r17;
unsigned int r19 = mul( r4 , r4 );
unsigned int r20 = mul( T , 0x1bba92b3 );
unsigned int r21 = 0x525a1a8b - r20;
unsigned int r22 = mul( r4 , 0x0452b1bf );
unsigned int r23 = r21 + r22;
3http://gforge.inria.fr/projects/mpfi/
unsigned int r24 = mul( r19 , r23 );
unsigned int r25 = mul( r10 , r24 );
unsigned int V = r18 + r25;
Listing 1. Generated evaluation program.
4.2. Evaluation program validation
We have validated the above evaluation code using
Gappa4. Gappa is a software tool intended to help verifying
and formally proving properties on numerical programs. It
manipulates logical formulas involving the inclusion of ex-
pressions in intervals, and allows to bound rounding errors
in a certified way. The first validation step is to check that no
overflow can occur. This is easily done with Gappa by veri-
fying that no intermediate value can be greater than 232−1.
For proving the correct rounding inequality (5), we use
Property 2 for providing hints to Gappa. With the polyno-
mial a computed in Section 3.1, and no subdivision of the
domain T (that is, n = 1), the approximation error satisfies
α(a) ≤ θ0 = 3 · 2−29 ≈ 2−27.41. (17)
According to (14), we then take η0 = 2−25−(4−2−21)·θ0,
and use the hint
ρ(P) < η0 ≈ 2−26.99 (18)
as a sufficient condition on the rounding error during the
evaluation of the program P . This approach succeeds in the
case mx ≥ my where, with the help of Gappa, we have
checked that (18) is true. Note that the strict inequality is
checked with Gappa through an inequality ρ(P) ≤ η0 − 
for a small enough, positive .
In the case mx < my , with s∗ ∈ [2, 4 − 2−21] and
t∗ ∈ [2−23, 1 − 2−23], we need to split up the domain T .
Indeed, we find points t∗ at which (18) is not satisfied. For
example, for t∗ = 0.97490441799163818359375, with t∗
in a small enough interval T (i) (see the last row of Table 2),
we can check that ρ(P) < ρ(i)(P) < η4 ≈ 2−26.77. Nev-
ertheless, considering the approximation error on the same
interval, we can establish (14) since, in compensation, it can
be proven that α(i)(a) ≤ θ4 ≈ 2−27.49 < α(a).
For validating the division program, and finding ade-
quate subintervals T (i)’s, we have implemented a splitting
of T by dichotomy. This process has split the definition do-
main up into n = 36127 subintervals. For each of them, us-
ing Sollya, we first bound the approximation error α(i)(a).
This gives α(i)(a) ≤ θ for some rational number θ. Then
the sufficient condition (14) is established by checking that
ρ(i)(P) < η = 2−25 − (4 − 2−21) · θ. The dichotomy is
illustrated by Table 2 where “no” in the last column means
that the interval in the second column has to be split up.
4http://lipforge.ens-lyon.fr/www/gappa/ and [14].
Depth Subintervals α(·)(a) ≤ ρ(·)(P) < Does (14) hold?
1 I1,1 = [2−23, 1− 2−23] θ1 ≈ 2−27.41 η1 ≈ 2−26.99 no
2
I2,1 = [2−23, 0.5− 2−23] θ2 ≈ 2−27.41 η2 ≈ 2−26.99 yes
I2,2 = [0.5, 1− 2−23] θ1 ≈ 2−27.41 η1 ≈ 2−26.99 no
· · ·
j
Ij,1 = [2−23, 0.5− 2−23] θ2 ≈ 2−27.41 η2 ≈ 2−26.99 yes
Ij,2 = [0.5, 0.75− 2−23] θ1 ≈ 2−27.41 η1 ≈ 2−26.99 yes
Ij,19309 = [0.921875, 0.92578113079071044921875] θ3 ≈ 2−27.44 η3 ≈ 2−26.90 yes
Ij,19533 = [0.97490406036376953125, 0.97490441799163818359375] θ4 ≈ 2−27.49 η4 ≈ 2−26.77 yes
Table 2. Splitting steps.
The bounds on α(·)(a) and ρ(·)(P) (where the · stands for
the index of the interval in the subdivision) that are discov-
ered with Sollya andGappa are given in the third and fourth
columns. For the exact values of the θj’s and ηl’s we refer
to Appendix B.
5. Implementation of a complete division code
So far we have presented an efficient way of deducing
from mx and my an integer V such that v = V · 2−30 =
(1.v1 . . . v30)2 satisfies (4). To obtain a complete code for
binary floating-point division it remains to compute the sign
and the exponent of the result, to deduce from v a correctly-
rounded significand, and to pack those three fields accord-
ing to the standard encoding of the binary32 format. If ei-
ther x or y is a special operand then a special value must be
returned, as prescribed in [1], which requires specific han-
dling. The following sections detail algorithms for each of
those tasks along with some C codes well-suited for a target
like the ST231 processor. (All the variables are unsigned
int, except for D which is int.)
From Section 5.1 to Section 5.3, both x and y are sup-
posed to be normal numbers, while in Section 5.4 at least
one of them is special, that is, ±0, ±∞, qNaN or sNaN.
5.1. Sign and exponent computation
The sign sr of the result r is trivially obtained by taking
the XOR of the sign bits of X and Y :
Sr = ( X ˆ Y ) & 0x80000000;
The result exponent er will be obtained by first comput-
ing the integer
D = d+ emax − 1. (19)
If emin ≤ d ≤ emax then r is normal and the bits of its bi-
ased exponent Er = D + 1 will be deduced by adding the
correctly-rounded significand toD · 2p−1 (see Section 5.2);
otherwise, since subnormals are not supported, r must take
particular values like ±0, ±2emin or ±∞, and we will see in
Section 5.3 how such situations can be detected directly by
inspecting the integer D.
Let us now compute D. Since x and y are normal, their
biased exponents are Ex = ex + emax and Ey = ey + emax.
Therefore, using (19) together with the definition of d in (2),
D = Ex − Ey + emax − 2 + c. (20)
When k = 32, p = 24, and emax = 127, an implementation
of (20) that exposes ILP is:
absX = X & 0x7FFFFFFF; absY = Y & 0x7FFFFFFF;
Ex = absX >> 23; Ey = absY >> 23;
int D = (Ex + 125) - (Ey - c);
Note that the biased exponent values Ex and Ey have
been obtained by first computing the absolute values of x
and y (by setting the sign bits to zero), and then shifting
right by p − 1 = 23 positions. Of course, there are other
ways of extracting these biased exponent values, such as,
for example for Ex,
Ex = ( X >> 23 ) & 0xFF;
or
Ex = ( X << 1 ) >> 24;
However, we tend to prefer the version involving absX
and absY, for these absolute values will be reused when
computing special values in Section 5.4.
5.2. Rounding and packing
Given v that approximates ` from above as in (4), the cor-
rectly rounded significandmr = RN(`) can be deduced es-
sentially as in [8] and [5, p. 459] (see also [20, 15] and [17,
§3.3]): defining
w = (1.v1 . . . vp−1vp)2
as the truncated value of v to p fraction bits, one has
mr =
{
w truncated to p−1 fraction bits if w≥`,
w+2−p truncated to p−1 fraction bits if w<`.
To check this, note first that by truncation to p fraction bits,
0 ≤ v − w < 2−p, which together with (4) leads to
|`− w| < 2−p. (21)
Then it remains to verify for vp = 0 and for vp = 1, that the
above definition of mr always gives RN(`), using in par-
ticular the classic fact that ` cannot be exactly halfway be-
tween two consecutive normal floating-point numbers (see
for example [11, Lemma 1] or [3, p. 229]).
To implement the above definition ofmr one has to com-
pute w and to decide whether w ≥ `. Since v = V · 22−k
and the bit string of V is 01v1 . . . vk−2, truncating v to p
fraction bits means zeroing out the last k − p− 2 bits of V .
For (k, p) = (32, 24) the line below sets to zero the last six
bits of the bit string 01v1 . . . v24v25v26v27v28v29v30 of V :
W = V & 0xFFFFFFC0;
The bit string ofW is thus 01v1 . . . v24000000 and w =
W · 2−30. To know if w ≥ ` or not, let us introduce the
integerMy such thatmy = My · 21−k. With k = 32 and S
as in (15), we have the following characterization:
Property 3. The condition w ≥ ` is true if and only if the
condition mul(W,My) >= (S >> 1) is true.
Proof. From w = W · 2−30, my = My · 2−31, and s∗ =
S · 2−30, we deduce that w ≥ ` = s∗/my is equivalent to
W ·My · 2−32 ≥ S/2. Now, by definition of mul as a floor
function, we have
W ·My ·2−32−1 < mul(W,My) ≤W ·My ·2−32. (22)
Using the two inequalities in (22) together with the fact
that mul(W,My) and S/2 are nonnegative integers repre-
sentable with 32 bits, one can check that w ≥ ` if and only
if mul(W,My) ≥ S/2. In C, the latter condition reads
mul(W,My) >= (S >> 1).
To deduce the value of mr = (1.mr,1 . . .mr,p−1)2 let
Mr be the integer such thatmr = Mr ·21−p. It then follows
from w = W · 22−k that
Mr =
{
bW · 2−(k−p−1)c if w ≥ `,
b(W + 2k−p−2) · 2−(k−p−1)c if w < `.
Thus, for (k, p) = (32, 24), Mr is obtained by shifting ei-
therW orW + 26 to the right by seven positions.
Let us now pack the bits ofmr with the bits of sr andEr.
Since in Section 5.1 we have in fact computedD = Er−1,
removing the leading 1 in mr = (1.mr,1 . . .mr,p−1)2 can
be avoided: the k-bit integer that encodes the result r is
sr · 2k−1 +D · 2p−1 +Mr. (23)
In particularD ∈ {0, . . . , 2emax−1} implies thatD ·2p−1+
Mr must be less than 2k and thus never propagates a carry
to the sign bit.
Since we expect the computation of sr andD to be much
cheaper than that of Mr, one may first take the bitwise OR
of sr ·2k−1 andD·2p−1, and then only addMr. For (k, p) =
(32, 24) and p = 24 this gives the following code sequence:
My = (Y << 8) | 0x80000000;
if( mul(W,My) >= (S >> 1) )
return (Sr | (D << 23)) + (W >> 7);
else
return (Sr | (D << 23)) + ((W + 0x40) >> 7);
5.3. Overflow and underflow detection
The value of the integer D = Er − 1 in (20) can be
used to detect overflow and underflow. Indeed, the normal-
ity assumption on x and y implies that both Ex and Ey lie
in the normal (biased) exponent range {1, . . . , 2emax}. Since
c is either 0 or 1, it follows from (20) that D ranges from
−emax−1 to 3emax−2. This range contains {0, . . . , 2emax−1},
for which the result is a normal number; it also contains two
extremal ranges, for which either overflow or underflow oc-
curs.
Let us start with overflow. If D ≥ 2emax then Er = er +
emax gives er ≥ emax + 1. Therefore |x/y| ≥ 2emax+1 and, by
definition of the rounding operator RN(·) in [1], we have
RN(x/y) = (−1)sr∞.
For emax = 127 this case can be implemented as:
if( D >= 0xFE ) return Sr | 0x7F800000;
Now for underflow. If D < 0 then er ≤ emin − 1.
Using ` < 2 then gives |x/y| < 2emin . Following [7]
we round |x/y| to nearest-even as if subnormals were sup-
ported. This rounded value can be either 2emin (the smallest
positive normal number) or a subnormal number. By defi-
nition of RN(·) the first case occurs if and only if
(1− 2−p) · 2emin ≤ |x/y| < 2emin , (24)
and we shall return the normal number (−1)sr2emin . In the
second case, since we do not support subnormals, we shall
return (−1)sr0.
Property 4. One has (24) if and only if ex − ey = −emax
andmx = 2− 21−p andmy = 1.
Proof. Assume that (24) holds. Then it follows from
|x/y| = ` · 2d with ` ∈ [1, 2) that d must be equal to
emin − 1 = −emax, and that ` ≥ 2− 21−p. Property 1 implies
further that ` = 2 − 21−p and mx ≥ my . Using the defi-
nition of ` in (2) then givesmx/my = 2− 21−p and, since
1 ≤ mx,my ≤ 2 − 21−p, we conclude that the only possi-
ble value for (mx,my) is (2− 21−p, 1). Applying (2) with
(c, d) = (1,−emax) gives furthermore ex−ey = −emax. Con-
versely, one may check that if (mx,my) = (2 − 21−p, 1)
and ex − ey = −emax then |x/y| = (1− 2−p) · 2emin .
Using (20) with c = 1 shows that the condition ex −
ey = −emax is equivalent to D = −1. With Mx and My
computed as in Sections 4 and 5.2, Property 4 thus suggests
the following code for handling underflow:
if( D < 0 ) {
if( D == -1 && Mx == 0xFFFFFF00
&& My == 0x80000000 )
return Sr | 0x00800000;
else
return Sr;
}
5.4. Computing special values
Assume now that (x, y) is a special input, that is, x or
y is ±0, ±∞, qNaN, or sNaN. For each possible case the
standard [1] requires that a special value be returned. These
special values follow from those in Table 3 by adjoining the
correct sign, using x/y = (−1)sr |x|/|y|. (The standard
does not specify the sign of a NaN result [1, §6.3].)
|x|/|y| |y|
+0 normal +∞ NaN
|x|
+0 qNaN +0 +0 qNaN
normal +∞ RN(|x|/|y|) +0 qNaN
+∞ +∞ +∞ qNaN qNaN
NaN qNaN qNaN qNaN qNaN
Table 3. Special values for |x|/|y|.
The standard binary encoding [1, §3.4] implies in par-
ticular Table 4. This table allows to reuse absX and absY,
Value or range of integerX Floating-point datum x
0 +0
[2p−1, 2k−1 − 2p−1) positive normal number
2k−1 − 2p−1 +∞
(2k−1 − 2p−1, 2k−1 − 2p−2) sNaN
[2k−1 − 2p−2, 2k−1) qNaN
Table 4. Floating-point data encoded by X.
which areX and Y modulo 2k−1 (and are computed in Sec-
tion 5.1 for k = 32), to filter out all special input (x, y).
Indeed, x or y is in {±0,±∞, qNaN, sNaN} if and only if
absX or absY is in {0} ∪ {2k−1 − 2p−1, . . . , 2k−1 − 1},
which is equivalent to
max(absXm1,absYm1) ≥ 2k−1 − 2p−1 − 1, (25)
with absXm1 = (absX − 1) mod 2k and absYm1 =
(absY− 1) mod 2k.
Assume now that (x, y) has been filtered out by (25). Ta-
ble 3 then requires that a qNaN be returned if absX equals
absY, or ifmax(absX,absY) > 2k−1−2p−1.Otherwise,
if x is finite (that is, absX < 2k−1−2p−1) and y is nonzero
(that is, absY is nonzero) then ±0 is returned, else ±∞ is
returned. Hence the code below when (k, p) = (32, 24):
absXm1 = absX - 1; absYm1 = absY - 1;
if( max(absXm1,absYm1) >= 0x7F7FFFFF ) {
Inf = Sr | 0x7F800000; Max = max(absX,absY);
if( absX == absY || Max > 0x7F800000 )
return Inf | 0x00400000 | Max; // qNaN
if( (absX < 0x7F800000) && absY ) return Sr;
return Inf;
}
Here absXm1 and absYm1 are unsigned int, so
that addition is done modulo 232. Note also that thanks to
the use of the Max variable the payload of the qNaN result
is one the input payloads, as recommended in [1, §6.2.3].
6. Experimental results
We report here some experiments done with the com-
plete division code that follows immediately from the C
codes of Sections 4 and 5.
Validation of the complete division code. Although our
code has not been tested exhaustively, it has been validated
using two classical test programs for IEEE binary floating-
point arithmetic, and especially for division: the Extremal
Rounding Tests Set [13] and the TestFloat package [7]. This
was done by compiling the code with Gcc under Linux and
by using the software implementations of max and mul
given in Appendix A.
Results obtained with the ST200 VLIW compiler. The
same code has then been compiled with the ST200 VLIW
compiler (version 2-0-0-A 20061215) based on the Open64
compiler technology, re-targeted to the ST200 family.
This compiler can produce an annotated assembly out-
put, with specific scheduling annotations giving the date at
which each bundle (a group of up to 4 instructions) is sched-
uled. In addition, one peculiarity of the ST200 architecture
is that it provides a partial predication support in the form of
conditional selection instructions, that are used by the com-
piler [2] to generate branch-free code for all the subroutines
described in this article.
Using optimization level -O3, the assembly produced in
our case indeed consists of fully if-converted straight-line
code in which the number of instructions as well as the la-
tency is the same regardless of the nature of the input (nor-
mal numbers or special values). The table below collects
these two values together with the number of instructions
per cycle (IPC) and the code size:
Number of inst. Latency IPC Code size
87 27 cycles 87/27 ≈ 3.22 424 bytes
Since the ST231 has four issues, the IPC value indicates
clearly the parallel nature of our approach.
Also, since one cycle is necessary for the last two in-
structions to select and return the result, the other 85 in-
structions must have been scheduled on the 4 issues during
the first 26 cycles. This value of 26 cycles is close to the
ideal value of 22 = d85/4e.
Finally, in the same context and for the same problem
(that is, software implementation of binary32 floating-point
division, correctly-rounded to nearest even and without sub-
normal numbers), the previously fastest implementation had
a latency of 48 cycles [19, p. 104]. Thus, the speed-up
brought by our approach is by a factor of 1.78.
7. Conclusion and future work
In this paper, we have focused on the binary32 format
and on a particular 32-bit architecture. However, the ap-
proach we have presented in Sections 4 and 5 for generat-
ing/validating polynomial evaluation codes and for design-
ing complete division codes, should in principle be usable
for other values of the key parameters k, p, and emax. This
could be of interest when optimizing division codes for
other processors and/or other floating-point formats (like
binary64, binary128, or smaller formats used in computer
graphics).
This work also suggests two research directions, which
we are currently investigating. First, gradual underflow is
important [9] and our division code should definitely be ex-
tended in order to support subnormal numbers while keep-
ing the latency as low as possible; similarly, all the rounding
attributes prescribed in [1] should be implemented. Sec-
ond, in some important special cases such as inversion (in-
stead of division) or faithfully rounded results (instead of
correctly rounded results), further optimizations should be
easy to implement and validate by using the generation/val-
idation tools and the design method that we have presented
here.
Acknowledgements
This work was supported by “Poˆle de compe´titivite´ mon-
dial” Minalogic and by the ANR project EVA-Flo.
References
[1] IEEE standard for floating-point arithmetic. IEEE Std. 754-
2008, pages 1–58, Aug. 2008.
[2] C. Bruel. If-conversion SSA framework for partially predi-
cated VLIW architectures. In Digest of the 4th Workshop on
Optimizations for DSP and Embedded Systems, 2006.
[3] M. Cornea, J. Harrison, and P. T. P. Tang. Scientific Com-
puting on Itanium R©-based Systems. Intel Press, 2002.
[4] J.-A. P. D. Piso and J. D. Bruguera. Analysis of the impact
of different methods for division/square root computation in
the performance of a superscalar microprocessor. Journal of
Systems Architecture, 49(12-15):543–555, 2003.
[5] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan
Kaufmann, 2004.
[6] J. Harrison, T. Kubaska, S. Story, and P. T. P. Tang. The
computation of transcendental functions on the IA-64 archi-
tecture. Intel Technology Journal, 1999.
[7] J. Hauser. The SoftFloat and TestFloat Packages. Available
at http://www.jhauser.us/arithmetic/.
[8] C.-P. Jeannerod, H. Knochel, C. Monat, and G. Revy. Com-
puting floating-point square roots via bivariate polynomial
evaluation. Technical Report RR2008-38, LIP, Oct. 2008.
[9] W. Kahan. A brief tutorial on gradual underflow. Avail-
able at http://www.cs.berkeley.edu/˜wkahan/
ARITH_17U.pdf, July 2005.
[10] C. Q. Lauter. Arrondi correct de fonctions mathe´matiques
- fonctions univarie´es et bivarie´es, certification et automati-
sation. PhD thesis, E´NS Lyon, France, 2008.
[11] P. W. Markstein. Computation of elementary functions on
the IBM RISC System/6000 processor. IBM Journal of Re-
search and Development, 34(1):111–119, 1990.
[12] P. W. Markstein. Software division and square root using
Goldschmidt’s algorithms. In Proc. of the 6th Conference
on Real Numbers and Computers, pages 146–157, 2004.
[13] D. W. Matula and L. D. McFearin. Extremal rounding test
sets. Available at http://engr.smu.edu/˜matula/
extremal.html.
[14] G. Melquiond. De l’arithme´tique d’intervalles a` la certifica-
tion de programmes. PhD thesis, E´NS Lyon, France, 2006.
[15] S. F. Oberman and M. J. Flynn. Fast IEEE rounding for divi-
sion by functional iteration. Technical Report CSL-TR-96-
700, Computer Systems Laboratory, Dept. on Electrical En-
gineering and Computer Science, Stanford University, 1996.
[16] S. F. Oberman and M. J. Flynn. Design issues in division
and other floating-point operations. IEEE Trans. Comp.,
46:154–161, 1997.
[17] S. F. Oberman and M. J. Flynn. Division algorithms and im-
plementations. IEEE Trans. Comp., 46(8):833–854, 1997.
[18] J.-A. Pin˜eiro and J. D. Bruguera. High-speed double-
precision computation of reciprocal, division, square root
and inverse square root. IEEE Trans. Comp., 51(12):1377–
1388, 2002.
[19] S.-K. Raina. FLIP: a Floating-point Library for Integer Pro-
cessors. PhD thesis, E´NS Lyon, France, 2006.
[20] E. M. Schwarz. Rounding for quadratically converging al-
gorithms for division and square root. In Proc. of the 29th
Asilomar Conference on Signals, Systems and Computers,
pages 600–603. IEEE Computer Society, 1995.
A. Implementing the max and mul instructions
A C99 implementation of max and mul is as follows:
static inline
unsigned int max(unsigned int A, unsigned int B)
{ return A > B ? A : B; }
static inline
unsigned int mul(unsigned int A, unsigned int B)
{
unsigned long long int t0 = A;
unsigned long long int t1 = B;
unsigned long long int t2 = ( t0 * t1 ) >> 32;
return t2;
}
The ST200 compiler is able to generate a single instruc-
tion when compiling the max or the mul function: it gen-
erates respectively the maxu and mul64hu ST200 instruc-
tions. This is interesting, since there is no need to write
code using specific instrinsic functions, while reaching best
efficiency.
B. Exact values of error bounds
Approximation error / Rounding error bounds
θ1 =
32666224213410587279617460750040978302345
2162
η1 =
91351292232540183223011219609051754040083752329
2183
θ2 =
32666103655948062771727762437637439466801
2162
η2 =
91352303541714218547566298100368266740699999537
2183
θ3 =
15949042999768214370553488520665149430143
2161
η3 =
48897450915206223329135016410664060291247987071
2182
θ4 =
123256080210706428762854279532157659493459
2164
η4 =
427554820082809494938604083452868552125485921363
2185
C. Useful hexadecimal constants
For convenience’ sake the table below displays the dec-
imal value and/or the bit string of each of the hexadecimal
constants appearing in some code sequences of Section 4
and Section 5.
0x80000000 231 = (1 00 . . . 00| {z }
31 zeros
)2
0x7FFFFFFF (0 11 . . . 11| {z }
31 ones
)2
0xFF (00 . . . 00| {z }
24 zeros
11111111| {z }
8 ones
)2
0xFFFFFFC0 (11 . . . 11| {z }
26 ones
000000| {z }
6 zeros
)2
0x40 26
0x7F800000 231 − 223 = (0 11111111| {z }
8 ones
00 . . . 00| {z }
23 zeros
)2
0xFFFFFF00 232 − 28 = (11 . . . 11| {z }
24 ones
00000000| {z }
8 zeros
)2
0x7F7FFFFF 231 − 223 − 1
0x00400000 (00 . . . 00| {z }
9 zeros
1 00 . . . 00| {z }
22 zeros
)2
