How to square floats accurately and efficiently on the ST231 integer processor by Jeannerod, Claude-Pierre et al.
How to square floats accurately and efficiently on the
ST231 integer processor
Claude-Pierre Jeannerod, Jingyan Jourdan-Lu, Christophe Monat, Guillaume
Revy
To cite this version:
Claude-Pierre Jeannerod, Jingyan Jourdan-Lu, Christophe Monat, Guillaume Revy. How to
square floats accurately and efficiently on the ST231 integer processor. 2010. <ensl-00532829>
HAL Id: ensl-00532829
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00532829
Submitted on 19 Nov 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
How to square floats accurately and efficiently on the ST231 integer processor
Claude-Pierre Jeannerod1, 3 Jingyan Jourdan-Lu2,3 Christophe Monat2 Guillaume Revy4
1INRIA, 2STMicroelectronics, 3Universite´ de Lyon, 4Universite´ de Perpignan
Laboratoire LIP (CNRS, ENS de Lyon, INRIA, UCBL), Universite´ de Lyon — 46, alle´e d’Italie, 69364 Lyon cedex 07, France
Compilation Expertise Center, STMicroelectronics — 12, rue Jules Horowitz BP217, 38019 Grenoble cedex, France
Laboratoire ELIAUS-DALI, Universite´ de Perpignan Via Domitia — 52 avenue Paul Alduy, 66860 Perpignan cedex 9, France
Abstract
We consider the problem of computing IEEE floating-
point squares by means of integer arithmetic. We show how
the specific properties of squaring can be exploited in or-
der to design and implement algorithms that have much
lower latency than those for general multiplication, while
still guaranteeing correct rounding. Our algorithm descrip-
tions are parameterized by the floating-point format, aim at
high instruction-level parallelism (ILP) exposure, and cover
all rounding modes. We show further that their C imple-
mentation for the binary32 format yields efficient codes for
targets like the ST231 VLIW integer processor from STMi-
croelectronics, with a latency at least 1.75x smaller than
that of general multiplication in the same context.
Keywords: squaring, binary floating-point arithmetic, cor-
rect rounding, IEEE 754, instruction level parallelism, C
software implementation, VLIW integer processor
1. Introduction
The STMicroelectronics ST231 is a 32-bit, 64-register,
4-issue, embedded integer VLIW processor. It is currently
used in consumer electronics devices (set-top-boxes, sound
processors, camera systems...) mainly performing intensive
media and signal processing. As this processor has integer-
only register file and ALUs, all the single precision floating-
point support is available through software emulation, based
on the highly optimized FLIP 1.0 library [9]. This compre-
hensive support is efficient enough to allow application de-
velopers to use out-of-the box floating-point code, instead
of converting it to fixed-point representations.
However, the implementation of floating-point opera-
tions on integer-only processors can lead to sub-optimal use
of the computational resources: for example, all binary op-
erators f(x, y) called with the same argument x = y lead
to redundant unpacking of the floating-point format, and to
useless checks for special cases, and fail to exploit some
properties of f(x, x). Since the compiler can detect all
static cases of such occurrences, this leads to the idea of
specializing operators to gain additional performance.
In this paper we focus on the specialization of the multi-
plication operator f : (x, y) 7→ x× y into a square operator
x 7→ x2. Squares of floating-point values are ubiquitous in
scientific computing and signal processing, since they are
intensively used in any algorithm requiring the computation
of Euclidean norms, powers, sample variances, etc. [6, 11].
We give a thorough study of how to design efficient soft-
ware for IEEE floating-point squaring on targets like the
ST231, that is, by means of integer arithmetic and logic
only, and with high ILP exposure.
Our first contribution is to show how the specific prop-
erties of squaring can be exploited in order to refine the
IEEE specification of multiplication, to deduce definitions
of special input and generic input that are suitable for im-
plementation, and to optimize the generic path and the spe-
cial path for latency. This analysis is done for all rounding
modes and presented in a parameterized fashion, in terms
of the precision and the exponent range of the input/output
floating-point format.
Second, this analysis allows us to produce a complete
portable C code for the binary32 format and each round-
ing mode. On the ST231 processor, the result is a latency
of 12 cycles for rounding ’to nearest even,’ which is 1.75x
faster than the 21 cycle latency of the multiply operator of
FLIP 1.0; for the other rounding modes, the speedups are
even higher and range from 1.9 to 2.3. Also, the average
number of instructions issued every cycle lies between 3.4
and 3.5, thus indicating heavy use of the 4 issues available.
Third, we report on some experiments involving non-
IEEE variants and square-intensive applications. We show
that relaxing the IEEE requirements (finite math only, no
support of subnormals) saves at most 1 cycle in the context
of the ST231. We also show that for applications like the
Euclidean norm, the sample variance, and binary powering
the practical impact of our fast squarer reflects well the best
that can theoretically be achieved.
For squaring in integer / fixed-point arithmetic several
optimized hardware designs have been proposed; see for
example [3, §4.9] as well as [14, 4] and the references
therein. However, for squaring in IEEE floating-point arith-
metic much less seems to be available and to the best of
our knowledge, no optimized design has been presented and
analyzed in the details as we do here, be it in hardware or
in software. Furthermore, the implementation of squaring
for the binary32 format and the ST231 processor outlined
in [12] does not support subnormal numbers, is available
only for rounding ’to nearest even,’ and has a latency of 27
cycles, which is 2.25x more than our 12 cycle latency.
Notation. For a real number r we write ⌊r⌋ for the
largest integer not greater than r, and ⌈r⌉ for the smallest
integer not less than r. If r satisfies further 0 ≤ r < 2 then
its binary expansion
∑
i≥0 ri2
−i with ri ∈ {0, 1} for all i
will be simply written (r0.r1r2 . . .)2. For a floating-point
datum x, we write |x| for the absolute value of x, assum-
ing implicitly that x is not NaN. Finally, we shall often use
the following square bracket notation: for any true-or-false
statement S, let [S] be 1 if S is true, 0 otherwise.
Outline. This paper is organized as follows. After some
reminders on the ST231 processor and IEEE binary floating
point in Section 2, we show in Section 3 how to specialize
to squaring the IEEE specification of multiplication, deduce
suitable definitions of generic and special input, and give a
high-level algorithmic view of the squaring operator. Sec-
tions 4 and 5 then detail our algorithms for handling, respec-
tively, generic and special input by means of integer arith-
metic, and give the corresponding C implementation for the
binary32 format. The performances of this implementation
on the ST231 processor are reported in Section 6, and we
conclude in Section 7. The detailed proofs of Properties 1
to 9 have been postponed to Appendix A for more legibility.
2. Background
2.1. STMicroelectronics’ ST231 processor
The targeted processor is the ST231 from STMicroelec-
tronics. Since it is a scoreboarded VLIW, there is no dy-
namic instruction dispatch contrary to what could achieve
an equivalent 4-way superscalar architecture: reaching the
best performance relies heavily on compiler efficiency.
Instruction scheduling is done after code selection,
that must be aggressive to favor high ILP. Specifically,
the if-conversion optimization that replaces conditional
branches by conditional ‘slct’ instructions helps creating
large straight-line code regions that are subject to efficient
scheduling. The latency of this ’slct’ instruction is 1 cycle.
Up to four independent instructions can be bundled into
a ‘syllable,’ achieving the maximum throughput of four in-
structions per cycle. One constraint is that the immediate
value that can be encoded as part of an instruction is limited
to 9-bit signed: a larger 32-bit constant, named extended
immediate, can be built by consuming an additional instruc-
tion syllable in the bundle. This reduces the actual paral-
lelism available locally, but this compares favorably with
other mechanisms such as loading the constant from mem-
ory, that incurs a performance impact on the data cache side
of the machine.
Only a very small subset of the ST231 instruction set
is needed by our squaring algorithms: logical and bitwise
operators (&&, ||, &, |), relational operators (<=, >=, >),
bitwise shift operators (<<, >>), addition, subtraction, max-
imum, and minimum of (un)signed integers (+, -, max,
maxu, minu), and finally a multiply operator mul giving
the higher half ⌊AB/232⌋ of the exact product of two 32-bit
unsigned integers A and B. Except for mul whose latency
is 3 cycles, the latency of each of these operators is 1 cycle.
2.2. IEEE 754-2008 binary floating point
Binary floating-point data. As specified in the IEEE
754-2008 standard [7], binary floating-point data consist
of quiet or signaling not-a-numbers (qNaN, sNaN), signed
zeros (+0, −0), signed infinities (+∞, −∞), and finite
nonzero binary floating-point numbers having the following
form: given a precision p and an exponent range [emin, emax],
x = (−1)s ·m · 2e, (1a)
where s is either 0 or 1, and where
m = (m0.m1 . . .mp−1)2 and emin ≤ e ≤ emax. (1b)
Any such number must in fact be either normal (m0 = 1) or
subnormal (m0 = 0 and e = emin). Thus, writing α for the
smallest positive subnormal number and Ω for the largest
normal number, we have
α = 2emin−p+1 and Ω = (2− 21−p) · 2emax .
Assumptions on emax, emin, and p. We shall assume that
emax = 2
w−1 − 1 for some positive integer w, (2a)
emin = 1− emax, 2 ≤ p < emax, (2b)
and we shall write
k = w + p.
All the binaryk formats of the IEEE 754-2008 standard sat-
isfy these assumptions; see [7, Table 3.5] and, for a proof
that 2 ≤ p < emax, see Appendix A. This allows us to carry
out all the analysis in a parameterized way and specialize
later to a given format, for example the binary32 format:
w = 8, emin = −126, emax = 127, p = 24.
Standard encoding into k-bit integers. For the binaryk
format the standard encoding of x in (1) is via a k-bit un-
signed integer X whose bit string satisfies
X = [s|Ew−1 · · ·E0|m1 · · ·mp−1], E =
w−1∑
i=0
Ei2
i, (3)
with E = e − emin + m0. Zeros, infinities, and NaNs
are encoded via special values of X . In particular, writing
|X| = X mod 2k−1, we have
x =


+0 iff X = 0,
+∞ iff X = 2k−1 − 2p−1,
sNaN iff 2k−1 − 2p−1 < |X| < 2k−1 − 2p−2,
qNaN iff |X| ≥ 2k−1 − 2p−2.
(4)
Correct rounding. Besides floating-point data and their
encoding into integers, the standard [7, §4.3] defines round-
ing modes ◦ to map any real number y to a unique floating-
point datum x = ◦(y). In radix 2 four rounding modes are
required: to nearest even (RN), down (RD), up (RU), and to
zero (RZ). In the case of the square operator we shall restrict
to RN, RD, and RU, since RZ(y) = RD(y) when y ≥ 0.
Exceptions and flags. Finally, the standard defines five
exceptional situations (invalid operation, division by zero,
overflow, underflow, inexact) and requires that they shall be
signaled by raising some status flags. For square, all excep-
tions but ’division by zero’ can occur, just like for multiply.
However, since the status flags are currently not set in the
multiply operator of the FLIP library [9] to which we com-
pare, we have not implemented them for square either.
3. Specification and high-level algorithm
Squaring being a special case of general multiplication
x × y, it is fully specified by the IEEE 754-2008 standard:
given a rounding mode ◦ and assuming x = y, the result r
prescribed by [7] for x× y is as follows:
r =


|x| if x ∈ {±0,±∞},
qNaN if x is NaN,
◦(x2) otherwise.
(5)
This specification shows that r is essentially known in ad-
vance when x is zero, infinity, or NaN. However, in the par-
ticular case of squaring, if |x| is nonzero but “small enough”
then the rounded value ◦(x2) will always be equal to a tiny
constant. Similarly, if |x| is finite but “large enough” then
◦(x2) will always be equal to a huge constant. The rest of
this section studies such properties of ◦(x2) in order to re-
fine the specification (5) and deduce a high-level algorithm.
Let min◦ and max◦ denote, respectively, the minimum
value and maximum value of ◦(x2) for |x| in [α,Ω]. Using
the monotonicity, for x ≥ 0, of the map x 7→ x2 together
with the definitions of rounding and overflow in [7], one
may check that these values are as follows:
◦ RN RD RU
min◦ +0 +0 α
max◦ +∞ Ω +∞
(6)
For which values of x are such extremal values attained? To
answer this, let us define the following two quantities
α′ = 2⌊(emin−p)/2⌋ and Ω′ = 2(emax+1)/2. (7)
We give below three properties regarding these quantities.
Property 1. The values α′ and Ω′ defined in (7) are normal
floating-point numbers such that α′ < Ω′.
Property 2. For ◦ ∈ {RN,RD,RU} and x a finite nonzero
floating-point number, one has ◦(x2) = max◦ iff |x| ≥ Ω′.
The interval [Ω′,Ω] thus defines the widest input range
on which max◦ is achieved. Interestingly, this range is the
same for all our rounding modes, which makes things sim-
pler from the implementer point of view. Note also that Ω′
is an integer power of two because of (2a). Formin◦ the sit-
uation is slightly more complex, as this minimum value is
achieved on an input range [α, α′◦] whose upper bound now
depends on ◦. However, the property below shows that one
can suppress this dependency by restricting to input ranges
whose upper bound has the form 2i for some integer i.
Property 3. The value α′ in (7) is the largest integer power
of two such that, for every finite nonzero floating-point num-
ber x in [α, α′), ◦(x2) = min◦ for ◦ ∈ {RN,RD,RU}.
The main outcome of Properties 2 and 3 is the following
specification of floating-point squaring, which refines (5):
r =


+0 if x = ±0,
min◦ if α ≤ |x| < α′,
◦(x2) if α′ ≤ |x| < Ω′,
max◦ if Ω
′ ≤ |x| ≤ Ω,
+∞ if x = ±∞,
qNaN if x is NaN.
(8)
This brings a natural distinction between two kinds of input:
Definition 1. Input x is called generic if α′ ≤ |x| < Ω′,
and special otherwise.
By Property 1 every subnormal input is special, so that
generic input consist of normal numbers only. The corre-
sponding output ◦(x2) must be finite because of the “only
if” part in Property 2, but it can be (sub)normal or zero.
Let us now define
Cspec = [x is special]. (9)
This condition allows us to translate (8) into the following
high-level algorithmic description, which shows that squar-
ing essentially reduces to three independent tasks Ti:
evaluate the condition Cspec in (9) [T1]
if (Cspec) then
handle special input as in (8) [T2]
else
compute ◦(x2) [T3]
Thanks to if-conversion, the generated assembly for the
above algorithm will consist of a straight-line piece of code
computing the result Ri of each task Ti and ending with a
’slct’ instruction that selects R2 if R1 is true, R3 otherwise.
For the design and implementation of each task we shall
proceed in two steps as in [1]: assuming unbounded paral-
lelism we optimize the a priori most expensive task first,
namely task T3 (see Section 4), and then only T1 and T2,
by trying to reuse as much as possible what was computed
for T3 (see Section 5). The latency of ’slct’ being of 1 cycle,
the lowest latency we can expect for squaring thus is 1 cycle
more than that of T3.
4. Computing correctly-rounded squares
In this section we consider the computation of ◦(x2) for
x generic, that is, x as in (1) and such that
α′ ≤ |x| < Ω′. (10)
By Property 1 such an x is normal, and thus 1 ≤ m < 2.
4.1. Parameterized formula for ◦(x2)
Normalized representation of the exact square. A first
step towards the computation of ◦(x2) consists in normal-
izing the representationm2 · 22e of x2 implied by (1a). Let
c =
[
m ≥
√
2
]
. (11)
Defining m′ = m2 · 2−c and e′ = c + 2e then yields the
unique pair (m′, e′) ∈ R× Z such that 1 ≤ m′ < 2 and
x2 = m′ · 2e′ . (12)
This is the so-called normalized representation of the exact
square. Tight bounds for e′ are given by the next result.
Property 4. The normalized exponent e′ of x2 satisfies
2⌊(emin − p)/2⌋ ≤ e′ ≤ emax.
These bounds for e′ are the best possible ones:
• The lower bound is attained when |x| = α′. It is equal
to emin−p−ǫwith ǫ =
[
p is odd
]
. One has ǫ = 0 for the
standard binary32 format, and ǫ = 1 for the standard
binary16, binary64, and binary128 formats [7, §3.6].
• The upper bound emax is attained for example when
x = (1.1)2 ·2(emax−1)/2, which satisfies (10) and whose
square is (1.001)2 · 2emax .
Also, the tight lower bound on e′ is less than emin for p ≥ 2,
so that both situations e′ ≥ emin and e′ < emin do occur.
Correctly-rounded value of the exact square. When
e′ ≥ emin the normalized representation (12) already allows
to express the correctly-rounded value ◦(x2). In this case
x2 lies in the range [2emin , 2emax+1) and, sincem′ ∈ [1, 2),
◦(x2) = ◦(m′) · 2e′
= ◦(m2 · 2−c) · 2c+2e. (13a)
When e′ < emin the exact square ranges in (0, 2
emin). In
this case, we first set the exponent to emin and only then
round the resulting scaled significand in fixed point:
◦(x2) = ◦˜(m′ · 2−(emin−e′)) · 2emin
= ◦˜(m2 · 2−(emin−2e)) · 2emin , (13b)
where ◦˜ denotes the function that rounds the reals from
[0, 2) in the same direction as ◦ but on the regular grid
{i · 21−p : i = 0, 1, . . . , 2p}.
In order to handle the cases (13a) and (13b) simultane-
ously, let us define
µ = max(c, emin − 2e). (14)
Since ◦˜ coincides with ◦ on [1, 2), the correctly-rounded
value of the exact square is given in both cases by
◦(x2) = ◦˜(ℓ) · 2d, (15a)
where
ℓ = m2 · 2−µ and d = µ+ 2e. (15b)
By construction one has 0 < ℓ < 2 and emin ≤ d ≤ emax. The
property below further gives bounds for the range of µ.
Property 5. One has c ≤ µ ≤ p+ ǫ with ǫ = [p is odd].
Again, these bounds are tight: x = 1 gives µ = 0, while
x = ±α′ gives c = 0, e′ = emin− p− ǫ, and then µ = p+ ǫ.
An explicit formula for ◦˜(ℓ). The above analysis has re-
duced, for precision p, floating-point rounding of the exact
result x2 to fixed-point rounding of the scaled significand ℓ
in (15). Using classical rounding formulae (see for exam-
ple [3, §8.4.3]), we now give an explicit expression for ◦˜(ℓ).
Writing ℓ = (ℓ0.ℓ1 . . . ℓp−1ℓp . . .)2 and using ∨ for logical
OR, the guard bit and sticky bit of ℓ are, respectively,
g = ℓp and t = ℓp+1 ∨ ℓp+2 ∨ · · · . (16)
Its correctly-rounded value is then given by the formula
◦˜(ℓ) = (ℓ0.ℓ1 . . . ℓp−1)2 + b · 21−p, (17)
where the round bit b satisfies (see [3, pp. 436-437])
b =


g ∧ (ℓp−1 ∨ t) if ◦ = RN,
0 if ◦ = RD,
g ∨ t if ◦ = RU.
(18)
4.2. Implementation for the binaryk format
We detail here how to implement, for x generic and
the binaryk floating-point format, the computation of r =
◦(x2) using k-bit integer arithmetic and logic. We assume
x is given by its standard encoding into an unsigned k-bit
integerX . Since the result r satisfies (15a), it is known (see
for example [13, §2.3.1]) that its standard integer encoding
R is given by R = D · 2p−1 + L˜, where
D = d+ emax − 1 (19)
and L˜ = ◦˜(ℓ) · 2p−1. Using (17) we obtain L˜ = L+ b with
L =
⌊
ℓ · 2p−1⌋ (20)
and b the sticky bit defined in (18), so that eventually
R = D · 2p−1 + L+ b. (21)
Consequently, computing R amounts to deducing D, L,
b from X , which we detail now in a parameterized fashion,
that is, for the binaryk format. We illustrate our analysis
for the binary32 format, and the corresponding C code (for
rounding ’to nearest even’) appears in Listing 1. In order
to give an idea of the ILP exposed by our approach, this C
code has been set out in such a way that line i displays only
the expressions that can be evaluated in i cycles with the
ST231 latencies and assuming unbounded parallelism.
Computing L. From (15b) and (20) it follows that
L =
⌊
m2/21−p+µ
⌋
(22)
and, therefore, we first need to deduce from X an integer
encoding ofm, sayM , as well as the integer µ.
To produceM , recall thatm = (1.m1 . . .mp−1)2 as x is
normal. Since p ≤ k a possible choice is to set upm · 2k−1,
which can be obtained from (3) by shifting and masking:
M = m · 2k−1 = (X ≪ w) | 2k−1.
For the binary32 format, where w = 8 and k = 32,
this corresponds to line 2 in Listing 1; on ST231, this
takes 2 cycles and, due to the extended immediate value
231 = (80000000)16, 3 instruction syllables.
To get µ, recall first that x normal implies e = E − emax.
Then, recalling that emin = 1− emax and applying (14),
µ = max(c, F ), F = emax + 1− 2E. (23)
To get the boolean value c, it suffices to remark that (11)
andM = m · 2k−1 imply
c = [M > M0], with M0 =
⌊√
2 · 2k−1
⌋
.
To get the (possibly negative) integer F , first we extract 2E
from X in (3) by using the identity
2E =
(
X ≫ (p− 2))&(2w+1 − 2), (24)
and then we subtract 2E from the constant emax + 1. For the
binary32 format the computation of c and F appears at line
3 of Listing 1, where (b504f333)16 is M0 for k = 32; on
ST231 each of c and F takes 3 cycles, so that µ is eventually
obtained in 4 cycles.
Let us now see how to deduce L from M and µ. The
property below shows that the k most significant bits of the
2k-bit integerM2 are enough for that purpose.
Property 6. L = ⌊H/2µ+w−1⌋ with H = ⌊M2/2k⌋.
Given M the mul instruction computes the higher half H
ofM2, which then, by Property 6, simply has to be shifted
right by µ + w − 1 in order to yield L. For the binary32
format, this appears at lines 5 and 6 of Listing 1. Since here
p = 24 is even, we deduce from Property 5 that µ + 7 is
at most 31, which thus agrees with the C99 specification of
the bitwise shift operator [8, p. 84]. With the ST231 latency
constraints, both H and µ + 7 are computed from X in 5
cycles, so that L is obtained in 6 cycles.
Computing b. We focus here on the most interesting
case, rounding to nearest even, for which b = g∧ (ℓp−1∨ t)
with g and t as in (16).
Note first that g is the least significant bit of the integer
G = ⌊ℓ · 2p⌋ = ∑0≤i≤p ℓi2p−i and, using a proof similar
to that of Property 6, we arrive at
g = G mod 2, G = ⌊H/2µ+w−2⌋.
For the binary32 format, the corresponding C code appears
at lines 6 and 9 of Listing 1. On ST231, G will have the
same latency as L (6 cycles), and we thus get g in 7 cycles.
Since ℓp−1 is the least significant bit of L we have
ℓp−1 = L mod 2,
so that we are left with computing the sticky bit t. The next
result shows how to recover this bit simply by checking that
some lower parts of H and X are nonzero, that is, without
computing the lower half of the exact squareM2.
Listing 1. Computing ◦(x2) for the binary32 format, ◦ = RN, and x generic.
1 T2 = X & 0xff;
2 M = (X << 8) | 0x80000000; E2 = (X >> 22) & 0x1fe;
3 F = 128 - E2; c = M > 0xb504f333;
4 mu = max(c, F);
5 H = mul(M, M);
6 L = H >> (mu + 7); G = H >> (mu + 6); T1 = H << (26 - mu);
7
8
9 b = (G & 1) && ((L & 1) | (T1 | T2));
10 return (((mu - F) << 23) + L) + b;
Property 7. One has t = [T1 6= 0] ∨ [T2 6= 0] with T1 and
T2 the two k-bit integers given by
T1 = H ≪ (p+ 2− µ) and T2 = X mod 2p−⌊k/2⌋.
For the binary32 format, Property 7 gives T2 = X mod 2
8,
which can be implemented by masking X as shown at line
1 of Listing 1. The computation of T1 is a mere left shift
of H by 26 − µ, the latter value ranging in [0, 31] thanks
to Property 5. Then notice that the bit ℓp−1 ∨ t is zero if
and only if the integer U obtained by bitwise-ORing the
integers L & 1 and T1 and T2 is zero. The logical AND
of g = G & 1 ∈ {0, 1} and U is thus enough to yield b,
which allows us to avoid testing explicitly if T1 or T2 is
nonzero. This is shown at line 9 of Listing 1. The paren-
thesization chosen there aims to reduce the overall latency
for b on ST231: both L and T1 can be obtained in 6 cycles,
while T2 costs 1 cycle; therefore, both L & 1 and T1 |T2
follow in 7 cycles, which yields b in 9 cycles.
ComputingD. From the definitions of d andD in (15b)
and (19) we deduce that D = µ + 2e + emax − 1. Hence,
recalling that e = E − emax and the definition of F in (23),
D = µ− F.
For the binary32 format, this subtraction appears at line 10
of Listing 1. Recalling that on ST231 we get F and µ in,
respectively, 3 and 4 cycles, we will thus get D in 5 cycles.
Packing the result. From (21) the integer encoding R
of the result satisfies R = D · 2p−1 + L + b and we have
just detailed how to get from X the integers D, L, and b.
Moreover, assuming the latency model of the ST231, their
respective cost has been shown to be of 5, 6, and 9 cycles.
This implies a latency of 6 cycles forD ·2p−1, and using the
parenthesization shown at the last line of Listing 1 we even-
tually get R in 10 cycles. Thus, when ◦ is RN the overall
cost is larger than that of the round bit b by only one cycle.
Some simplifications when ◦ is not RN. When the
rounding mode ◦ is RD the round bit b in (18) is zero. Con-
sequently, the instructions involvingG, T1, T2, and b can be
suppressed and the last line of Listing 1 replaced with:
return ((mu - F) << 23) + L;
When ◦ is RU the bit ℓp−1 is not needed and b is the
logical OR of g = G & 1 and T1 |T2. In this case, we thus
replace line 9 of Listing 1 by:
b = (G & 1) | (T1 | T2);
With these new codes the expected latency of R on
ST231 drops from 10 to 7 cycles for ◦ = RD, and from
10 to 9 cycles for ◦ = RU.
5. Detecting and handling special input
We first have to decide whether input x is special or not,
that is, to compute from X the value of Cspec in (9). By
Definition 1 this condition satisfies
Cspec = Csmall ∨ Clarge ∨ Cnan (25)
with Csmall =
[|x| < α′], Clarge = [|x| ≥ Ω′], and Cnan =[
x is NaN
]
. The next two properties show how to obtain
Csmall and Clarge ∨ Cnan by reusing the value 2E computed
for the generic case (see (24) and line 2 of Listing 1).
Property 8. Csmall = [2E ≤ emax − p− 1].
Property 9. Clarge ∨ Cnan = [2E ≥ 3emax + 1].
For the binary32 format, a C fragment implementing Cspec
by means of 2E and the two previous properties is shown
at lines 2 to 4 of Listing 2. On ST231 the cost will be of 4
cycles and, as Cspec is independent of ◦, this fragment holds
not only for ◦ = RN but also for ◦ ∈ {RD,RU}.
Listing 2. Detecting and handling special in-
put for the binary32 format and ◦ = RN.
1absX = X & 0x7fffffff;
2E2 = (X >> 22) & 0x1fe; Cnan = absX > 0x7f800000;
3Csmall = E2 <= 102; Clarge_or_nan = E2 >= 382;
4Cspec = Csmall || Clarge_or_nan;
5if (Cspec) {
6if (Csmall) return 0; // r = +0
7else {
8if (Cnan) return 0x7fc00000; // r = qNaN
9else return 0x7f800000; } // r = +oo
10} else {
11// generic case (Listing 1).
12}
Once special input have been filtered out, it remains to
return, for the given rounding mode ◦, the standard integer
encoding R of the associated result r prescribed by (8):
When ◦ is RN. We deduce from (6) and (8) that for x
special, r must be +0 if |x| < α′, qNaN if x is NaN, and
+∞ otherwise. Implementing this is then straightforward
as it suffices to recall from (4) that, on the one hand 0,
2k−1 − 2p−1, and 2k−1 − 2p−2 are standard encodings of
+0, +∞, and qNaNs, and that, on the other hand,
Cnan =
[
X &(2k−1 − 1) > 2k−1 − 2p−1].
For the binary32 format, the computation of Cnan is shown
at lines 1 and 2 of Listing 2, while lines 6 to 9 display
the computation of R. On ST231, lines 6 to 9 will be if-
converted as shown by the following pseudo-code:
Rlarge_or_nan = slct(Cnan,0x7fc00000,0x7f800000)
R = slct(Csmall,0,Rlarge_or_nan)
With a latency of 1 cycle for the ’slct’ instruction and since
Cnan costs 2 cycles, we thus get R for x special in 4 cycles.
When ◦ is not RN. For ◦ = RD the only difference with
the previous case is when |x| ≥ Ω′ and, by(6) and (8), we
now have r = max(|x|,Ω). The standard integer encoding
of Ω is 2k−1 − 2p−1 − 1, which equals (7f7fffff)16 for
the binary32 format. Consequently, it suffices to replace
line 9 of Listing 2 with:
else return maxu(absX, 0x7f7fffff);
For ◦ = RU, the specification differs from that for ◦ =
RN only in the case where |x| < α′, for which we have
r = min(|x|, α). Since the standard integer encoding of α
is 1, an implementation for the binary32 format follows by
simply replacing line 6 in Listing 2 by
if (Csmall) return minu(absX, 1);
On ST231, R still costs 4 cycles as both max(|x|,Ω) and
min(|x|, α) have a latency of 2 cycles, like Cnan.
6. Experimental results obtained on ST231
The C codes detailed in Sections 4 and 5 yield a full im-
plementation of squaring, for the binary32 format and each
rounding mode. To check correctness we compiled them
with gcc (using a C emulation of the mul, max, maxu, and
minu operators as given for example in [13, Appendix B]),
and compared with the results of multiplication x × x ob-
tained on an Intel R© Xeon R© workstation. For each rounding
mode this exhaustive test took about five minutes.
We also compiled our C codes with the ST200 compiler,
in -O3 for the ST231 processor. The remainder of this sec-
tion details the performances obtained in this context.
6.1. Operator performances
Latency and comparison with general multiplication.
The latency on ST231 of our binary32 square implemen-
tation is shown in the second column of Table 1. Due to
if-conversion, it gives for each rounding mode a number of
clock cycles independent of the input value x. The values
within square brackets indicate the lowest latencies we can
theoretically achieve with the ST231 latency constraints and
assuming unbounded parallelism; these best latencies fol-
low from our analysis in Sections 4 and 5 and have the form
1+L with L the best latency for the generic case. This first
experiment shows that the latencies achieved in practice are
at most 1 cycle form the best possible ones.
For comparison, the third column of Table 1 displays
the latencies of the multiply operator x× y available in the
FLIP 1.0 library and optimized for the ST231 [9]. As shown
in the fourth column, our specialization of this multiply op-
erator into a square operator yields a speedup between 1.75
and 2.3, depending on the rounding mode.
◦ square FLIP 1.0 multiply speedup
RN 12 [11] 21 1.75
RD 9 [8] 21 2.3
RU 11 [10] 21 1.9
RZ 9 [8] 18 2
Table 1. Latency comparison for square and multiply.
Instruction-level parallelism. When designing our al-
gorithms in Sections 4 and 5 we strived to expose as much
ILP as we could. To evaluate ILP in practice we use
instructions-per-cycle (IPC), which is the parallelism really
exposed on the target. As shown in Table 2 it is deduced
from the assembly code by dividing the number of instruc-
tions by the latency. The IPC achieved is close to the highest
ILP reachable within the architectural constraints of the ma-
chine, demonstrating a very efficient usage of its resources.
◦ Latency L Number N of instructions IPC = N/L
RN 12 42 3.5
RD 9 31 3.4
RU 11 37 3.4
Table 2. Latency, code size, and IPC for square.
Comparison with two non-IEEE variants. To study
the impact on latency of relaxing the IEEE 754 specification
used so far, we have implemented for each rounding mode a
finite-math-only variant and a variant without subnormals.
Finite math only. We assume here that input and output
are neither infinity nor NaN, and that overflow does not oc-
cur. Hence x now satisfies |x| < Ω′. On the one hand, this
leaves the generic path unchanged, so that the best possible
latencies are the same as for our IEEE version. On the other
hand, (25) becomes Cspec = Csmall and the C codes of Sec-
tion 5 can be simplified accordingly. As the third column of
Table 3 shows, in practice this simplification has no impact
on latency for RD, while it saves 1 cycle for RN and RU.
No subnormals. Here we assume that x is not subnormal,
which means, writing λ = 2emin for the smallest positive
normal number, that x is either NaN, zero, or such that λ ≤
|x|. We also assume that if the exact result x2 lies in the
subnormal range (0, λ) then r = +0 for RN and RD, and
r = λ for RU. Since x2 < λ is equivalent to |x| < λ′ with
λ′ =
√
λ, the specification of this non-IEEE variant can
thus be deduced from (6) and (8) simply by replacing α and
α′ with, respectively, λ and λ′.
For the special path, this relaxed specification implies
Csmall = [|x| < λ′] and the proof of Property 8 can be
adapted to show that we now have Csmall = [2E ≤ emax−1].
Thus, it suffices to replace 102 by 126 at line 3 of Listing 2
and, for RU, to replace 1 by 223 in the minu operation.
These updates clearly have no impact on the latency.
Concerning the generic path, the case (13b) need not be
considered anymore, so that µ in (14) is now equal to c.
Hence the max operation at line 4 of Listing 1 can be re-
moved and we can replace µ by c at line 6. However, as H
still has a latency of 5 cycles, this simplification does not
shorten the critical path. This means that the best latencies
that we can theoretically achieve with this second non-IEEE
variant are the same as for our IEEE version. Furthermore,
Table 3 shows that this is true in practice as well.
◦ IEEE finite math only no subnormals
RN 12 11 12
RD 9 9 9
RU 11 10 11
Table 3. Latency comparison with two non-IEEE variants.
6.2. Application examples
We now apply our fast operator to some square-intensive
algorithms in order to study its effect on real applications.
After giving a theoretical model of the speedup achievable
on ST231, we show practical speedups and how they match
that model. All experiments have been done on the ST231
cycle-accurate simulator, and while we focus on rounding
’to nearest even’ (RN) similar results hold for RU and RD.
Speedup model for loop nests involving squares.
While Table 1 gives speedups for a single replacement of
multiply by square, we now evaluate the theoretical expec-
tation of speedup for loops. When the square operator ap-
pears in a loop of n iterations, we introduce the definition
theoretical speedup =
(L+∆ · σ) · n+ C
L · n+ C
with L, ∆, σ, and C given as follows:
• L is the latency of the loop body where squares are
used, which includes the application-related operations
inside the loop and the cost to maintain the loop.
• ∆ is the latency gap between multiply and square.
• σ is the number of squares in a single iteration.
• C is the cost of the straight-line code outside the loop.
Note that when n tends to infinity, the theoretical speedup
tends to 1 + ∆ · σ/L, which does not depend on C.
On ST231 the values ∆, L, and C have the following
features. First, Table 1 gives ∆ = 21 − 12 = 9 cycles.
Second, L and C can be modelled as
L = a · Br+ Idx+ Ld+ Cin, C = St+ Cout.
Here, Br is the cost of a taken branch, which is 3 cycles.
The unrolling transformation done by the compiler has the
effect of dividing the number of branches taken to jump
back to the head of the loop by the unrolling factor; a in
(0, 1] is used to denote this effect. The value of a depends
on the application and on the compiler optimization level;
to estimate the trend of the speedup, we take a = 0.5, mean-
ing that the loop is always unrolled by a factor of 2. Idx is
the cost to test loop index, which is 1 cycle; Ld is the cost
to load input values, which is 3 cycles. Cin (resp. Cout) is
the latency of the application-related code within (resp. out-
side) the loop; for each application, the values of Cin and
Cout can be deduced from the latencies of the 5 basic op-
erators of FLIP 1.0 [13, Table 1] and from the latency of
12 cycles of our square operator. Finally, St is the cost of
stack handling of the function call; it ranges from 10 to 12
cycles depending on the achievable ILP of each application.
The above model has been applied to three application
examples, which we review now.
Example 1: Euclidean norm. Given a vector v of n
floating-point data vi we consider the computation of its
Euclidean norm ‖v‖2 = (
∑n
i=1 v
2
i )
1/2 by means of three
different algorithms.
Naive algorithm. Here ‖v‖2 is produced by a for loop
whose ith iteration simply squares vi and adds it to the cur-
rent partial sum of squares. Figure 1 shows that the theoret-
ical speedup tends to about 1.185 and that in practice we are
close to this value as soon as n ≥ 20, assuming n is known
at runtime. If n is known at compile time then an even
higher speedup is observed, since the compiler achieves
more efficient loop unrolling optimization. The nine black
bullets in Figure 1 illustrate this fact for n = 2, . . . , 10. The
greatest speedups are for n ≤ 4, since in this case all the
Figure 1. Impact of square on naive Euclidean norm.
parameters are directly passed to the function by registers
instead of by stack and the loop is fully unrolled.
Two-pass algorithm. This algorithm, which is already
mentioned in [2], aims to avoid overflow by first computing
‖v‖∞ = maxi |vi| and then applying the naive algorithm
to the scaled vector [vi/‖v‖∞]i. The input is scanned twice
and n divisions are used. Thus, the speedup we can expect
is lower than for the naive algorithm, as shown in Figure 2.
However, we see that the practical speedup still matches
well the theoretical model as soon as n ≥ 20.
Figure 2. Impact of square on two-pass Euclidean norm.
One-pass algorithm. This algorithm also intends to
avoid overflow but requires only one pass over the data, the
scaling factor being now computed on the fly. It is an adap-
tation of Blue’s algorithm [2] attributed to Hammarling and
implemented in LAPACK [6, p. 507]. Each of the n itera-
tions performs exactly 1 square as well as 2 or 4 additional
operations, depending on the data vi. The total number of
operations thus varies dynamically and turns out to be max-
imum when |v1| < |v2| < · · · < |vn|, and minimum when
|v1| ≥ |v2| ≥ · · · ≥ |vn|. Each of these two extreme cases
yields a behavior similar to the one displayed in Figure 2,
the limiting value when n → ∞ being about 1.077 in the
first case, and about 1.096 in the second case.
Example 2: sample variance. The sample variance of
v1, . . . , vn is
1
n−1
∑n
i=1(vi − v¯)2, where v¯ = 1n
∑n
i=1 vi.
It is known [6, p. 11] that it can be evaluated accurately
Figure 3. Impact of square on binary powering.
by the naive way, which requires two passes over the data:
get v¯ first and then iteratively square and add the vi − v¯’s.
This method has the same structure as the two-pass algo-
rithm in the previous example (with the maximum replaced
by a sum, and the division replaced by a subtraction). Con-
sequently, the theoretical and practical speedups brought by
our fast squarer are similar to those in Figure 2.
Example 3: binary powering. For a floating-point da-
tum x and for n an integer power of two such that n ≥ 4,
we consider here the evaluation of xn by means of log2 n
successive squares. The results obtained for this applica-
tion are shown in Figure 3. The qualitative analysis done
for the naive algorithm of Example 1 still applies but, since
binary powering does not involve any operation other than
squaring, higher speedups are observed.
7. Conclusion and perspectives
This work has presented optimized C codes for IEEE bi-
nary32 squaring which are, on ST231, significantly faster
than the corresponding implementation of general multipli-
cation, thus demonstrating the practical benefit of operator
specialization in this context.
To achieve this, we strived to exploit the specific features
of the square operator: it is univariate, always nonnegative,
and has predictable overflow ranges; subnormal input can
always be handled by the special path; and, in the generic
path, the lower half of the exact square of the input signif-
icand is not needed, and this even for computing the sticky
bit used to provide correct rounding. We also heavily relied
on some features of our target processor and its compiler,
like VLIW 4-issue parallelism, if-conversion and fast ’slct’
instruction, fast min and max instructions, and the availabil-
ity of extended immediates and of an integer multiplier of
type ’32 × 32→ 32 high.’ However, as our C codes show,
the subset of the ST231 instruction set needed by our fast
squarer is extremely modest and, unlike for multiplication,
the leading-zero count instruction is useless in this context.
Furthermore, all our algorithmic descriptions are param-
eterized by the binaryk format, thus offering increased con-
fidence and flexibility: the correctness analysis is done once
for all, and one can deduce portable C codes for various
formats in a direct way as soon as the corresponding k-bit
integer arithmetic and logic are available. In fact, we al-
ready have such a code for the binary64 format and it relies
on a 64-bit integer layer highly optimized for ST231. This
code is certainly suboptimal, though, as some parts of the
algorithm for binary64 squaring could directly use 32-bit
integers. Therefore, it would be interesting to implement an
optimized 32-bit version of binary64 squaring and to evalu-
ate the latency improvement compared to the 64-bit version
produced with our approach.
Three other extensions would also be worth considering.
First, although our approach assumes radix 2 we should
investigate to which extent it carries over other radices,
and especially radix 10. Second, the overhead (in terms
of latency and code size) of setting status flags remains
to be studied. Third, squaring is not only a specialization
of multiplication but also of integer powering, that is, of
pown(x,n) = xn with n an integer. Since the 2008 revision
of the IEEE 754 standard recommends correct rounding for
this operator [7, Table 9.1], its optimized implementation
for VLIW integer processors will be a natural extension of
the work we have presented here.
References
[1] C. Bertin, C.-P. Jeannerod, J. Jourdan-Lu, H. Knochel,
C. Monat, C. Mouilleron, J.-M. Muller, and G. Revy. Tech-
niques and tools for implementing IEEE 754 floating-point
arithmetic on VLIW integer processors. In Proceedings of
PASCO’10, pages 1–9, New York, NY, USA, 2010. ACM.
[2] J. L. Blue. A portable Fortran program to find the Euclidean
norm of a vector. ACM Trans. Math. Software, 4(1):15–23,
1978.
[3] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan
Kaufmann, 2004.
[4] M. Go¨k. Integer squarers with overflow detection. Comput-
ers & Electrical Engineering, 34(5):378–391, 2008.
[5] R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete
Mathematics: A Foundation for Computer Science. Addi-
son-Wesley, Reading, MA, USA, second edition, 1994.
[6] N. J. Higham. Accuracy and Stability of Numerical Algo-
rithms. Society for Industrial and Applied Mathematics,
Philadelphia, PA, USA, second edition, 2002.
[7] IEEE Computer Society. IEEE Standard for Floating-Point
Arithmetic. IEEE Standard 754-2008, Aug. 2008.
[8] International Organization for Standardization. Program-
ming Languages – C. ISO/IEC Standard 9899:1999,
Geneva, Switzerland, Dec. 1999.
[9] C.-P. Jeannerod and G. Revy. FLIP 1.0: a fast floating-point
library for integer processors, February 2009. Available at
http://flip.gforge.inria.fr/.
[10] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod,
V. Lefe`vre, G. Melquiond, N. Revol, D. Stehle´, and S. Tor-
res. Handbook of Floating-Point Arithmetic. Birkha¨user
Boston, 2010.
[11] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Nu-
merical Recipes in C: The Art of Scientific Computing. Cam-
bridge University Press, Oct. 1992.
[12] S.-K. Raina. FLIP: a Floating-point Library for Integer Pro-
cessors. PhD thesis, E´NS Lyon, France, 2006.
[13] G. Revy. Implementation of binary floating-point arithmetic
on integer processors: polynomial evaluation-based algo-
rithms and certified code generation. PhD thesis, Universite´
de Lyon - E´NS de Lyon, France, Dec. 2009.
[14] E. G. Walters, J. Schlessman, and M. J. Schulte. Combined
unsigned and two’s complement hybrid squarers. In Pro-
ceedings of the thirty-fifth Conference on Signals, Systems,
and Computers (Asilomar 2001), volume 1, pages 861–866,
Asilomar, Pacific Grove, CA , USA, 2001. IEEE.
A. Proofs
Proof that 2 ≤ p < emax for all standard binary formats.
The standard binary formats are defined in [7, Table 3.5].
There we see that, on one hand, this is true when k ≤ 128.
On the other hand, when k > 128 we have
p = k − jk + 13 and emax = 2jk−14 − 1,
with jk = round(4 log2 k), the integer closest to 4 log2 k.
(If k is an integer then 4 log2 k cannot be exactly halfway
two consecutive integers, so that no tie can occur.) By def-
inition of round, 4 log2 k − 1/2 < jk < 4 log2 k + 1/2.
This implies first that, for k > 27, 2 ≤ p ≤ k. Sec-
ond, 2−14.5k4 − 1 < emax. Third, k3 > 215.5 and then
2−14.5k4 > 2k > k + 1, so that k < emax.
Proof of Property 1. Since α′ and Ω′ in (7) are integer
powers of two it suffices to verify that
emin ≤ ⌊(emin − p)/2⌋ < (emax + 1)/2 ≤ emax.
The leftmost inequality is equivalent to emin ≤ (emin − p)/2
because emin is an integer; since emin = 1 − emax, the lat-
ter inequality is itself equivalent to p < emax, which is true
by (2b). From (2b) it also follows in particular that emax ≥ 1,
which implies the rightmost inequality. The remaining in-
equality follows from the fact that (2b) implies that emin is
negative while p and emax are positive.
Proof of Property 2. If |x| ≥ Ω′ then x2 ≥ 2emax+1, so that
◦(x2) = max◦ for each ◦. Conversely, assume that |x| <
Ω′. Since x is a finite floating-point number in precision
p, we deduce that |x| ≤ (2 − 21−p) · 2(emax−1)/2 and then
x2 ≤ C ·2emax , with C = 2(1−2−p)2. Since C < 2−21−p
one has further x2 < Ω. This implies ◦(x2) < max◦ for
each ◦ and the conclusion follows.
Proof of Property 3. If α ≤ x < α′ then 0 < x2 < α/2,
so that RN(x2) = RD(x2) = +0 and RU(x2) = α.
This shows that x ∈ [α, α′) implies ◦(x2) = min◦ for
all ◦. To prove the maximality of α′ it suffices to check
that RN(y2) 6= minRN for some floating-point number y in
[α′, 2α′), say y = 32α
′. We have y2 = 942
2⌊(emin−p)/2⌋ and,
using the fact that ⌊i/2⌋ ≥ (i − 1)/2 when i ∈ Z, we de-
duce that y2 ≥ 98α/2 > α/2. It follows that RN(y2) = α,
which differs from minRN = +0.
Proof of Property 4. Recalling (7) and taking squares
in (10), we obtain 22⌊(emin−p)/2⌋ ≤ m′ · 2e′ < 2emax+1. On
the one hand,m′ < 2 implies 2⌊(emin − p)/2⌋ < e′+1 and,
since both sides are integers, the announced lower bound
follows. On the other hand, 1 ≤ m′ implies e′ < emax + 1
and, similarly, we deduce the announced upper bound.
Proof of Property 5. The lower bound is an immediate
consequence of the definition of µ in (14). To establish the
claimed upper bound we consider two cases:
• If µ = c then µ is at most 1 and cannot be larger than
p+ ǫ, since p ≥ 2 and ǫ ≥ 0.
• If µ = emin − 2e then, recalling that e′ = c + 2e
and noticing that the lower bound in Property 4 equals
emin − p + ǫ (because emin is even), we obtain µ ≤
p+ ǫ+ c. Since both µ and p+ ǫ are even integers, and
since c is either 0 or 1, it follows that µ ≤ p+ ǫ.
Hence µ ≤ p+ ǫ in both cases, and the proof follows.
Proof of Property 6. From (22), M = m · 2k−1, and k =
w + p it follows that L = ⌊y/n⌋ with y = M2/2k and
n = 2µ+w−1. Since w ≥ 3 and since, by Property 5, µ ≥ 0,
n is a positive integer. To conclude it suffices to apply the
fact that ⌊y/n⌋ = ⌊⌊y⌋/n⌋ for n > 0 (see [5, p. 72]).
Proof of Property 7. Let q be the number of trailing zeros
ofm = (1.m1 . . .mp−1)2. Thenm
2 can be written
m2 = (s−1s0.s1 . . . s2p−2q−2)2 with s2p−2q−2 = 1.
Thus,H = (s−1s0 . . . sk−2)2 and, using (15b) and (16), we
can also decompose the sticky bit as t = t1 ∨ t2 with
t1 = sp−µ+1∨· · ·∨sk−2 and t2 = sk−1∨· · ·∨s2p−2q−2.
By Property 5 and since w ≥ 3, we have k − 2− (p− µ+
1)+ 1 = µ+w− 2 ∈ {1, . . . , k− 1}. Hence t1 = 1 if and
only if the last µ+ w − 2 bits of H are not all zero, that is,
if and only if the integer T1 obtained by shifting H left by
k− (µ+w− 2) = p+2−µ is nonzero. Since s2p−2q−2 =
1 we have t2 = 0 if and only if k − 1 > 2p − 2q − 2,
that is, if and only if the number q of trailing zeros of m
is at least p − ⌊k/2⌋. The latter condition is equivalent to
X mod 2p−⌊k/2⌋ = 0, which concludes the proof.
Proof of Property 8. Let |X| and A′ denote the stan-
dard integer encodings of |x| and α′, respectively. It is
known [10, p. 58] that |x| < α′ if and only if |X| < A′,
that is,
|X| ≤ A′ − 1. (26)
By (3), |X| = (E+ ǫ) · 2p−1 with ǫ ∈ [0, 1− 21−p] and, by
Property 1, A′ = E′ · 2p−1 with E′ = ⌊(emin− p)/2⌋+ emax.
Thus, (26) is equivalent to E ≤ E′ − (ǫ+ 21−p). Since E,
E′ are integers and ǫ + 21−p ∈ (0, 1], the latter inequality
is equivalent to E ≤ E′ − 1. Now, emin = 1 − emax gives
E′− 1 = ⌊(emax− p− 1)/2⌋ and we conclude using the fact
that i ≤ ⌊j/2⌋ is equivalent, for integers i, j, to 2i ≤ j.
Proof of Property 9. We use the same notation as in the
proof of Property 8 and write O′ for the standard integer
encoding ofΩ′. The special integer encoding used for NaNs
gives Clarge ∨ Cnan = [|X| ≥ O′]. By Property 1 we have
O′ = (3emax +1)/2 · 2p−1, so that |X| ≥ O′ is equivalent to
E + ǫ ≥ (3emax + 1)/2. Since ǫ ∈ [0, 1), this is equivalent
to E ≥ (3emax + 1)/2 and the conclusion follows.
