Optimizing correctly-rounded reciprocal square roots for embedded VLIW cores by Jeannerod, Claude-Pierre & Revy, Guillaume
Optimizing correctly-rounded reciprocal square roots for
embedded VLIW cores
Claude-Pierre Jeannerod, Guillaume Revy
To cite this version:
Claude-Pierre Jeannerod, Guillaume Revy. Optimizing correctly-rounded reciprocal square
roots for embedded VLIW cores. Asilomar Conference on Signals, Systems, and Computers,
Nov 2009, United States. IEEE Signal Processing Society, 2009. <ensl-00391185v2>
HAL Id: ensl-00391185
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00391185v2
Submitted on 25 Nov 2009
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Optimizing correctly-rounded reciprocal square roots for embedded VLIW cores
Claude-Pierre Jeannerod1, 2 Guillaume Revy1, 2
1INRIA Grenoble - Rhoˆne-Alpes (Are´naire project-team),
2Laboratoire LIP (Universite´ de Lyon - UMR CNRS/E´NSL/INRIA/UCBL 5668)
email: {Claude-Pierre.Jeannerod,Guillaume.Revy}@ens-lyon.fr
E´cole normale supe´rieure de Lyon — 46, alle´e d’Italie, 69364 Lyon cedex 07, France
Abstract
This paper presents an optimized software implementa-
tion of the reciprocal square root function x 7→ x−1/2, for
IEEE binary32 floating-point data and with correct round-
ing to nearest. The main feature of this implementation is
high instruction-level parallelism (ILP) exposure, which re-
sults here from an extension of the bivariate polynomial
evaluation-based method of [6] as well as from the de-
sign of a specific rounding procedure. This implementation
proves to be very efficient for some VLIW processor cores
like STMicroelectronics’ ST231 (used mainly for embedded
media processing), for which a low latency of 29 cycles has
been measured.
Keywords: reciprocal square root, binary floating-point
arithmetic, correct rounding (to nearest), polynomial eval-
uation, software implementation, VLIW processor core.
1. Introduction
Reciprocal square roots frequently appear in digital sig-
nal processing and scientific computing [10], and correctly-
rounded implementations are recommended in the latest re-
vision of the IEEE 754 standard [1, §9.2]. Our aim here is
to present such an implementation, in software, for binary32
data and rounding to nearest even.
The targeted processors are the ST231 four-issue VLIW,
32-bit cores from STMicroelectronics, whose main features
are as follows: 4 parallel ALUs, 2 parallel multipliers (giv-
ing the first or last 32 bits of a 32 × 32 integer product),
a leading-zero counter, 64 general purpose registers and
8 condition registers, partial predication via select instruc-
tions, and encoding of immediate operands up to 32 bits.
In order to fully exploit the high degree of parallelism of
our target and to avoid using coefficient tables, we extend
to reciprocal square roots the high-ILP, polynomial-based
square rooting method introduced in [6]. This extension,
which is presented in Section 2, seems to allow for more
ILP exposure than the Newton-like iterations used for ex-
ample in [10, 7, 8]. Section 3 then gives some details about
our implementation of this extension for the binary32 for-
mat, its validation, and the performances obtained on the
ST231 core. In particular, a latency of 29 cycles has been
measured, for rounding to nearest even and with subnormal
number support.
A typical use of correctly-rounded reciprocal square
roots is for 3D vector normalization [x, y, z] 7→
[x/w, y/w, z/w], with w =
√
x2 + y2 + z2. In the con-
text of the FLIP library,1 our implementation allows to re-
place one square root (23 cycles) and three divisions (3×32
cycles) by one reciprocal square root (29 cycles) and three
products (3×21 cycles). Both cases yield an error of at most
1 ulp but, assuming the sum of squares is available, the lat-
ter reduces latency by 9% if the coordinates x/w, y/w, z/w
are produced simultaneously, and by over 20% if they are
produced sequentially.
2. Reciprocal square root algorithm
2.1. Special operands
For reciprocal square root, special operands are ±0,
±∞, negative numbers, and NaNs. The corresponding spe-
cial results required by the IEEE 754 standard [1] are shown
in Table 1. In this table “< 0” represents −∞ or any neg-
ative (nonzero) finite floating-point number, and “qNaN”
means any quiet Not-a-Number having the same payload as
the operand (that is, only the sign is not specified).
1http://flip.gforge.inria.fr/
Table 1. Special values of function rsqrt.
x +0 +∞ −0 < 0 NaN
rsqrt(x) +∞ +0 −∞ qNaN qNaN
Since here special operands are exactly the same as those
of square root, we apply the method described in [6, § 4.1]
to detect all of them simultaneously. Then, classically, the
special results displayed in Table 1 are computed in parallel
with the generic case (described next in Section 2.2), which
dominates the cost.
In some implementations the reciprocal square root of
−0 is +∞ instead of −∞. This is for example the case in
GNU MPFR.2 An advantage of such a choice is consistency
with the specification of the rootn function (x, n) 7→
x1/n, for which the standard mandates that the value +∞
be returned when x is −0 and n is even and negative.
2.2. Positive finite operands
When x is non special, it has the form x = m · 2e,
where, for a binary floating-point system of precision p and
extremal exponents emin and emax,
m = (m0.m1 . . .mp−1)2 and emin ≤ e ≤ emax.
More precisely, x is either such that m0 = 1 (normal num-
ber) or such that m0 = 0 and e = emin (subnormal num-
ber). We assume as usual that emin is even, emin ≤ emax,
emax = 1 − emin, and 2 ≤ p ≤ emax, all these constraints
being satisfied by the standard binary formats of [1].
In both subnormal and normal cases one may check that,
for example, 2emin ≤ x−1/2 ≤ 2emax . This implies that the
reciprocal square root of a non special number always lies
in the range [2emin , (2−21−p)·2emax ] of positive normal num-
bers or, in other words, that neither underflow nor overflow
can occur.
It follows that x−1/2 = ` · 2d for some real ` in [1, 2]
and some integer d such that emin ≤ d < emax. The correctly-
rounded (to nearest even) value of x−1/2 is thus given by
RN(x−1/2) = RN(`) · 2d,
and, classically, RN(`) and d can be computed in parallel.
First, we provide explicit formulae for ` and d. Let λ be
the number of leading zeros of m, and let m′ = m · 2λ and
e′ = e − λ. Let c be 1 if e′ is even, 0 otherwise. Then one
may check that
` = s
√
2/(1 + t) and d = b−(e′ + 1)/2c, (1)
where s = 2c/2 and t = m′ − 1, and with b·c denoting the
usual floor function.
2http://www.mpfr.org/mpfr-current/; see also [4, 11].
Second, the above formula for d being already suitable
for implementation, we focus on the computation of RN(`).
As in [6] we proceed by correcting “one-sided truncated ap-
proximations” similar to those of [3, p. 459]: ` is approxi-
mated from above by v to precision p. Then v is truncated
after p fraction bits into a number u. Finally RN(`) is pro-
duced by the following rounding algorithm:
if u ≥ ` then
return v truncated after p− 1 fraction bits
else
return v + 2−p truncated after p− 1 fraction bits
This rounding algorithm is not specific to reciprocal square
root and we initially used it in [6] for square root. The cor-
rectness proof given in [6, §3.2.1] only uses the fact below.
Fact 1. ` cannot be exactly halfway between two consecu-
tive floating-point numbers.
In the case of reciprocal square root, this fact is well
known to be true (see for example [5, Table 1]) and can
be shown as follows.
Proof. Assume ` is halfway between two consecutive
floating-point numbers. Then ` 6= 2 and, since ` ∈ [1, 2], its
binary expansion must have the form 1.`1 . . . `p−11. Con-
sequently, ` = (2L+ 1) · 2−p for some integer L such that
2p−1 ≤ L < 2p. Similarly, m′ = K · 21−p for some integer
K such that 2p−1 ≤ K < 2p. The first identity in (1) thus
gives K(2L + 1)2 = 23p+c, whose left-hand side has an
odd factor and whose right-hand side is an integer power of
two. Hence a contradiction and the proof follows.
The main difficulties of the above approach are to com-
pute v as fast as possible, and to evaluate the condition
u ≥ ` in order to decide whether u should be corrected
or not:
• To maximize ILP exposure, we compute v by the
method of [6], that is, as the value (up to rounding er-
rors) of a bivariate polynomial
P (s, t) = 2−p−1 + s · a(t) (2)
such that a(t) is a “good enough” polynomial approx-
imation of
√
2/(1 + t) over [0, 1).
• To decide whether u ≥ ` or not, we evaluate the equiv-
alent condition
(1 + t)u2 ≥ 2s2, (3)
whose both sides now have finite binary expansions.
3. Implementation for the binary32 format
The above algorithm has been implemented in C99 for
the binary32 format of [1], where p = 24 and emin = −126.
The lines of code for handling special operands, comput-
ing d, and evaluating the rounding condition (3) have been
written and optimized by hand. However, the polynomial
a(t) =
∑
i ait
i has been computed as a truncated Remez
approximant using Sollya,3 and a parallel and accurate eval-
uation code for v has been automatically written by a gener-
ator under development called CGPE (Code Generation for
Polynomial Evaluation [9]).
For each of these subtasks the goal is to achieve low la-
tency by exposing as much ILP as possible. Sections 3.1
and 3.2 illustrate this for, respectively, special-operand han-
dling and polynomial evaluation. Section 3.3 further details
our optimized implementation of the evaluation of condi-
tion (3). Section 3.4 concludes with the validation of the
resulting code and with performance results obtained on the
ST231 core.
3.1. Implementation of two specifications
for special operands
Our C implementation of the first specification (Table 1)
is given below. Here and hereafter X is the 32-bit unsigned
integer that carries the standard encoding of the binary32
floating-point operand x. (Also, unless otherwise specified
the variables of our C codes are always of type uint32 t.)
1 if ((X + 0xFFFFFFFF) >= 0x7F7FFFFF)
2 {
3 if (X <= 0x7F800000 || X == 0x80000000)
4 return X ˆ 0x7F800000; // +-inf or +0
5 return X | 0x7FC00000; // qNaN having the
6 // same payload as X
7 }
Line 1 allows to filter out all special inputs, while line 3 fur-
ther isolates inputs of the form ±0 or +∞; these two con-
ditions are exactly the same as for square root in [6, § 4.1].
The only difference is then the bitwise XOR of X and the
hexadecimal constant 0x7F800000, giving an overhead
of only one instruction (involving an extended immediate).
To implement the second specification of Section 2.1, it
suffices to replace line 4 in the code above with:
4 return ˜X & 0x7F800000; // +inf or +0
The bitwise negation of X allows in particular to exchange
the exponent fields of zero and infinity. Masking then sets
both the sign bit and the fraction field to zero.
3http://sollya.gforge.inria.fr/
On ST231, both specifications can be implemented at ex-
actly the same cost using the codes above. This comes from
the availability of an instruction andc that complements
the input and then applies a bitwise AND, and which has
the same latency (1 cycle) as the bitwise XOR instruction.
3.2. Generating an optimized code for poly-
nomial evaluation
The polynomial a(t) used has degree 9, and CGPE found
the following scheme for evaluating P (s, t) in (2):
r23
r22
r11 r21
r0
r1
r2
r3
T a1
2−25
S
a0
r10 r15
r9
r8r7
r6 r4
T T T
S
r5
r13
r12
T T
r14
r20
r19
r18 r17
r16a9 a7
a6
a5
a4
a3
a2
T a8
multiplcation (3 cycles)
addition (1 cycle)
In this scheme, all the variables are 32-bit unsigned in-
tegers, and addition and multiplication respectively mean
(A ± B) mod 232 and mul(A,B) := bAB/232c for two
such integers A and B. On the ST231, an addition takes
1 cycle and a multiplication takes 3 cycles. Furthermore, it
turns out that the above parenthesization can be evaluated in
14 cycles compared to 38 cycles using Horner’s rule. (One
of the ai’s is a power of two, which allows to trade off one
multiplication against one shift, and eventually to save two
cycles. Hence 38 cycles instead of 40 cycles.) The accu-
racy (rounding errors) of this evaluation scheme has been
checked using interval arithmetic from Gappa.4
3.3. Implementing the condition for correct
rounding
Assume that t and u are exactly represented by the k-bit
unsigned integers T = t ·2k and U = u ·2k−2, respectively.
To represent 1 + t exactly, we define a third k-bit unsigned
integer Y = (1+ t) ·2k−1. It follows that Y = 2k−1+T/2,
and condition (3) thus reads
Y U2 ≥ Z, Z = 23k−4+c, c ∈ {0, 1}.
Note that Y U2 is a 3k-bit positive integer that can be seen as
aQ5.(3k−5) number. However, since Z is a large power of
4http://gappa.gforge.inria.fr/ and [2].
two, evaluating the condition Y U2 ≥ Z does not require the
knowledge of the full product Y U2. Indeed, the property
below shows that it suffices to evaluate P ≥ Q, where P
and Q are two k-bit unsigned integers.
Property 1. Let P = bY U2 · 2−2kc and Q = 2k−4+c. One
has Y U2 ≥ Z if and only if P ≥ Q.
Proof. One has Z = Q · 22k. From P ≤ Y U2 · 2−2k it
follows that P ≥ Q implies Y U2 ≥ Z. Conversely, P < Q
implies P ≤ Q − 1 and, using Y U2 · 2−2k − 1 < P , it
follows that Y U2 · 2−2k < Q, that is, Y U2 < Z.
It remains to evaluate condition “P ≥ Q” as efficiently
as possible in our context. If k = 32 then Q equals 228+c
and can be implemented simply by shifting left by c:
Q = 0x10000000 << c;
To implement the computation of P = bY U2 · 2−2kc,
we may proceed as follows. Let Ahi and Alo be the two
k-bit, positive integers such that U2 = Ahi ·2k+Alo. Then
Y U2 = Y Ahi · 2k+Y Alo and, defining Bhi, Blo, Chi, Clo
as the k-bit positive integers such that Y Ahi = Chi·2k+Clo
and Y Alo = Bhi · 2k +Blo, we obtain
Y U2 = Chi · 22k + (Bhi + Clo) · 2k +Blo.
Consequently,
P = Chi +
(
carry generated by Bhi + Clo
)
. (4)
(The carry is 1 if addition overflows, 0 otherwise.) This is
illustrated by Figure 1, where Dhi = P and Dlo = Blo.
Y
× Ahi Alo
Bhi Blo
+ Chi Clo
Dhi Dmi Dlo
Figure 1. Decomposing the exact value of the
product Y U2 into three chunks of k bits each.
In particular, the lower half Blo of the product AloY is not
needed, nor is the middle term Dmi. All we must know is
whether adding Bhi to Clo overflows or not. The property
below shows how to detect this.
Property 2. The carry generated by the sum Bhi + Clo is
equal to 1 if and only if (Bhi + Clo) mod 2k < Bhi.
Proof. Recall that Bhi, Clo ∈ {0, 1, . . . , 2k − 1}, and let Q
and R be the unique non-negative integers such that
Bhi + Clo = Q · 2k +R and 0 ≤ R < 2k. (5)
Since Bhi + Clo < 2k+1, Q is either 0 or 1. Hence the
sum overflows if and only if Q = 1. If Q = 1 then
R = Bhi + Clo − 2k and, since Clo < 2k, we obtain
R < Bhi. Conversely, let us show that R < Bhi implies
Q = 1. Form (5), R < Bhi implies Clo < Q · 2k. Since
Clo ≥ 0 and Q ∈ {0, 1}, we deduce that Q = 1.
Combining (4) and Property 2 leads to the following im-
plementation for computing P .
1 A_hi = mul(U, U); A_lo = U * U;
2
3
4 B_hi = mul(Y, A_lo); C_lo = Y * A_hi;
5 C_hi = mul(Y, A_hi);
6
7 S = B_hi + C_lo;
8 carry = S < B_hi;
9 P = C_hi + carry;
Here, all variables are 32-bit unsigned integers. Further-
more, we assume available + and mul defined in Sec-
tion 3.2, and as well as * such that A * B := AB mod 232
for two 32-bit unsigned integers A and B. On ST231 cores,
< has the same latency as + (1 cycle) and * has the same
latency as mul (3 cycles). Therefore, the above code yields
P in 8 instructions and 9 cycles.
To sum up, we have shown so far that “P ≥ Q” can be
evaluated in 10 cycles. Since in (4) getting the carry is 1-
cycle more expensive than getting Chi (and, of course, than
getting Q), we may try to evaluate instead the equivalent
condition “carry≥ Q− Chi.” Unfortunately, how to im-
plement this comparison in fewer than 10 cycles is unclear.
However, on ST231 one can save 1 cycle by using the op-
eration addcg which performs integer addition with carry
propagation. More precisely, addcg takes two 32-bit un-
signed integersA,B and one bit cin (carry in) as inputs, and
produces as outputs the 32-bit unsigned integer C and the
bit cout (carry out) such that
A+B + cin = cout · 232 + C.
From Figure 1, we can thus obtain P = Dhi as follows:
• (Dmi, cout) = addcg(Bhi, Clo, 0),
• (P, ∗) = addcg(0, Chi, cout).
Notice that the carry ∗ produced by the second call to the
addcg instruction is in fact known to be zero.
A first way of implementing this computation of P is to
use the intrinsic function st200addcg, whose operands
are three 32-bit unsigned integersA,B, cin, and whose out-
put is the 64-bit unsigned integer cout · 232+C. Hence, the
lines 7, 8, 9 of the previous code can be replaced with:
7 uint64_t S = __st200addcg(B_hi, C_lo, 0);
8 P = (uint32_t) __st200addcg(0, C_hi, S >> 32);
The assembly code generated by the ST200 compiler (-O3,
ST231) on the above piece of code then looks like
convib $b0 = $r0;;
addcg $r0, $b0 = $r17, $r20, $b0;;
addcg $r8, $b0 = $r0, $r21, $b0;;
where reading $r0 returns 0, and where $r8 is a 32-bit reg-
ister containing the value of P . Since the latency of addcg
is of 1 cycle, we finally get P in 8 cycles instead of 9. The
main drawback of this approach is the explicit use of an in-
trinsic function, that makes the C code ST231-dependent.
A second way of getting P in 8 cycles through portable C
code is as follows. The above addcg-based approach com-
putes Dhi and Dmi, that is, performs the addition Bhi + C
exactly, whereC = Y Ahi is a full, 64-bit product. Hence P
can be computed as follows, with * now denoting the lower
half of the exact product of two 64-bit unsigned integers:
1 A_hi = mul(U, U); A_lo = U * U;
2
3
4 B_hi = mul(Y, A_lo); C = (uint64_t) Y
5 * (uint64_t) A_hi;
6
7 S = B_hi + C; // S and C are of type uint64_t
8 P = S >> 32;
Interestingly, the assembly generated by the ST200 com-
piler for this portable C code contains exactly the sequence
of two addcg shown above, thus yielding P in 8 cycles.
This code is the one we have kept for our implementation.
3.4. Validation and performances
Our implementation, called rsqrt, has been compared
exhaustively to the power functions of the glibc and MPFR.
We have also compiled it with the ST200 VLIW com-
piler, in -O3 and for the ST231 core. Without subnormal
support, the latency of the generated assembly code is of 28
cycles. For comparison, the previously best available code
for the ST231 was an implementation of Goldschmidt’s
method with initial approximation by a degree-3 polyno-
mial [8, §12] and has a latency of 56 cycles. Our approach
is thus twice faster. Also, our code offers full subnormal
support at the cost of only 1 extra cycle, the latency then
being of 29 cycles.
Finally, the table below shows the advantage of using our
specialized operator rsqrt rather than simply compound-
ing division/inversion and square root. (Brackets contain
the results obtained when subnormals are not supported.)
Code sequence used Number N of Latency L N/L
for computing x−1/2 instructions (cycles)
div(1.0f,sqrt(x)) 164 [141] 57 [48] 2.9 [2.9]
inv(sqrt(x)) 123 [110] 48 [43] 2.6 [2.6]
rsqrt(x) 68 [63] 29 [28] 2.3 [2.2]
Although each of the operators div, inv, and sqrt
used here is highly optimized for the ST231, full special-
ization yields significantly smaller and faster codes. In fact,
such codes are also more accurate since only one rounding
error occurs instead of two.
Acknowledgements
This research was supported by “Poˆle de compe´titivite´
mondial” Minalogic and by the ANR project EVA-Flo.
Many thanks to Christophe Monat and Philippe The´veny for
useful discussions and comments on a draft of this paper.
References
[1] IEEE standard for floating-point arithmetic. IEEE Std. 754-
2008, pp.1-58, Aug. 29 2008.
[2] M. Daumas and G. Melquiond. Certification of bounds on
expressions involving rounded operators. ACM Transactions
on Mathematical Software, 37(1), 2009.
[3] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan
Kaufmann, 2004.
[4] L. Fousse, G. Hanrot, V. Lefe`vre, P. Pe´lissier, and P. Zim-
mermann. MPFR: A Multiple-Precision Binary Floating-
Point Library with Correct Rounding. ACM Transactions on
Mathematical Software, 33(2), 2007.
[5] C. Iordache and D. W. Matula. On infinitely precise round-
ing for division, square root, reciprocal and square root re-
ciprocal. In Proceedings of the 14th IEEE Symposium on
Computer Arithmetic, pages 233–240, 1999.
[6] C.-P. Jeannerod, H. Knochel, C. Monat, and G. Revy. Com-
puting floating-point square roots via bivariate polynomial
evaluation. Technical Report RR2008-38, LIP, Oct. 2008.
[7] J.-A. Pin˜eiro and J. D. Bruguera. High-speed double-
precision computation of reciprocal, division, square root
and inverse square root. IEEE TC, 51(12):1377–1388, 2002.
[8] S.-K. Raina. FLIP: a Floating-point Library for Integer Pro-
cessors. PhD thesis, E´NS Lyon, France, 2006.
[9] G. Revy. CGPE - Code Generation for Polynomial Evalua-
tion. http://cgpe.gforge.inria.fr/.
[10] M. J. Schulte and K. E. Wires. High-speed inverse square
roots. In Proceedings of the 14th IEEE Symposium on Com-
puter Arithmetic, pages 124–131, 1999.
[11] P. Zimmermann. Implementation of the reciprocal square
root in MPFR. In A. Cuyt, W. Kra¨mer, W. Luther,
and P. Markstein, editors, Numerical Validation in Current
Hardware Architectures, number 08021 in Dagstuhl Semi-
nar Proceedings, Dagstuhl, Germany, 2008.
