Abstract-We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how to exploit the specific properties of squaring in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithms are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from STMicroelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.
I. INTRODUCTION
The STMicroelectronics ST231 is a 32-bit, 64-register, 4-issue, embedded integer VLIW processor that is mainly used for intensive media and signal processing. As this processor has integer-only register file and ALUs, the single precision floating-point support is available through software emulation, based on the highly optimized FLIP 1.0 library [1] . This comprehensive support is efficient enough to allow application developers to use out-of-the box floating-point code instead of converting it to fixed-point representations.
However, the implementation of floating-point operations on integer-only processors can lead to sub-optimal use of the computational resources: for example, all binary operators f (x, y) called with the same argument x = y lead to redundant unpacking of the floating-point format, and to useless checks for special cases, and fail to exploit some properties of f (x, x). Since the compiler can detect all static cases of such occurrences, this leads to the idea of specializing operators to gain additional performance.
In this paper we focus on the specialization of the multiplication operator f : (x, y) → x × y into a square operator x → x 2 . Squares of floating-point values are ubiquitous in scientific computing and signal processing, since they are intensively used in any algorithm requiring the computation of Euclidean norms, powers, sample variances, etc. [2] .
We give a thorough study of how to design efficient software for IEEE floating-point squaring on targets like the ST231, that is, by means of integer arithmetic and logic only, and with high ILP exposure. First, we show how to exploit the specific properties of squaring in order to refine the IEEE specification of multiplication, to deduce definitions of special input and generic input that are suitable for implementation, and to optimize the generic path and the special path for latency. This analysis is done for all rounding modes and presented in a parameterized fashion, in terms of the precision and the exponent range of the input/output floatingpoint format. Second, this analysis allows us to produce a complete portable C code for the binary32 format and each rounding mode. On ST231, the result is a latency of 12 cycles for rounding 'to nearest even,' which is 1.75x faster than the 21-cycle latency of the multiply operator of FLIP 1.0; for the other rounding modes, the speedups are even higher and range from 1.9 to 2.3. Also, the average number of instructions issued every cycle lies between 3.4 and 3.5, thus indicating heavy use of the 4 issues available.
For squaring in integer or fixed-point arithmetic several optimized hardware designs have been proposed [3] - [5] . However, for squaring in IEEE floating-point arithmetic much less seems to be available: for example, the software implementations presented in [6] , [7] do not mention that operator; furthermore, the implementation of squaring for the binary32 format and the ST231 processor outlined in [8] does not support subnormal numbers, is available only for rounding 'to nearest even,' and has a latency of 27 cycles. This paper is organized as follows. After some reminders on the ST231 processor and IEEE binary floating point in §II, we show in §III how to specialize to squaring the IEEE specification of multiplication, deduce suitable definitions of generic and special input, and give a highlevel algorithmic view of the square operator. § §IV and V then detail our algorithms for handling, respectively, generic and special input by means of integer arithmetic, and give the corresponding C implementation for the binary32 format. The performances of this implementation on ST231 are reported in §VI, and proofs of Properties 1 to 9 are available at http://hal.archives-ouvertes.fr/ensl-00532829/.
II. BACKGROUND

A. STMicroelectronics' ST231 processor
The targeted processor is the ST231 from STMicroelectronics. Since it is a scoreboarded VLIW, there is no dynamic instruction dispatch contrary to what could achieve an equivalent 4-way superscalar architecture: reaching the best performance relies heavily on compiler efficiency.
Instruction scheduling is performed after code selection, that must be aggressive to favor high ILP. Specifically, the if-conversion optimization that replaces conditional branches by conditional 'slct' instructions helps creating large straight-line code regions that are subject to efficient scheduling. The latency of this 'slct' instruction is 1 cycle.
Up to 4 independent instructions can be bundled into a syllable, achieving the maximum throughput of 4 instructions per cycle. One constraint is that the immediate value that can be encoded as part of an instruction is limited to 9-bit signed: a larger 32-bit constant, named extended immediate, can be built by consuming an additional instruction syllable in the bundle. This reduces the actual parallelism available locally but compares favorably with other mechanisms such as loading the constant from memory, that incurs a performance impact on the data cache side of the machine.
Only a very small subset of the ST231 instruction set is needed by our squaring algorithms: logical and bitwise operators (&&, ||, &, |), relational operators (<=, >=, >), bitwise shift operators (<<, >>), addition, subtraction, maximum, and minimum of (un)signed integers (+, -, max, maxu, minu), and finally a multiply operator mul giving the higher half AB/2 32 of the exact product of two 32-bit unsigned integers A and B. Except for mul whose latency is 3 cycles, the latency of each of these operators is 1 cycle.
B. IEEE 754-2008 binary floating point
Binary floating-point data. Besides NaNs, signed zeros, and signed infinities, the IEEE 754-2008 standard [9] specifies finite nonzero floating-point numbers as follows: given a precision p and an exponent range [e min , e max ],
where s is either 0 or 1, and where
Such numbers must in fact be either normal (m 0 = 1) or subnormal (m 0 = 0 and e = e min ). Thus, the smallest positive subnormal number is α = 2 emin−p+1 , while the largest normal number is Ω = (2 − 2 1−p ) · 2 emax . Assumptions on e max , e min , and p. We shall assume that e max = 2 w−1 − 1 for some positive integer w, (2a)
and we shall write k = w + p. All the binaryk formats of the IEEE 754-2008 standard satisfy these assumptions. This allows us to carry out all the analysis in a parameterized way and specialize later to a given format, for example the binary32 format: w = 8, e min = −126, e max = 127, p = 24. Standard encoding into k-bit integers. For the binaryk format the standard encoding of x in (1) is via a k-bit unsigned integer X whose bit string satisfies
with E = e − e min + m 0 . Zeros, infinities, and NaNs are encoded by special values of X: with |X| = X mod 2 k−1 ,
Correct rounding. Besides floating-point data and their encoding into integers, the standard [9, §4.3] defines rounding modes • to map any real number y to a unique floatingpoint datum x = •(y). In radix 2 four rounding modes are required: to nearest even (RN), down (RD), up (RU), to zero (RZ). In the case of the square operator we shall restrict to RN, RD, and RU, since RZ(y) = RD(y) when y ≥ 0.
Exceptions and flags. Finally, the standard defines five exceptional situations (invalid operation, division by zero, overflow, underflow, inexact) and requires that they shall be signaled by raising some status flags. For square, all exceptions but 'division by zero' can occur, just like for multiply. However, since the status flags are currently not set in the multiply operator of the FLIP library [1] to which we compare, we have not implemented them for square either.
III. SPECIFICATION AND HIGH-LEVEL ALGORITHM
As a special case of multiplication, squaring is specified by the IEEE 754-2008 standard: given a rounding mode • and with x = y, the result r prescribed by [9] 
Let min • and max • denote, respectively, the minimum value and maximum value of •(x 2 ) for |x| in [α, Ω]:
Let us also define the following two quantities:
and Ω = 2 (emax+1)/2 .
Property 1: The values α and Ω are normal floatingpoint numbers such that α < Ω .
Property 2: For • ∈ {RN, RD, RU} and x a finite nonzero floating-point number, •(x 2 ) = max • iff |x| ≥ Ω . Property 3: The value α is the largest integer power of two such that for every finite nonzero floating-point number
The main outcome of Properties 2 and 3 is the following specification of floating-point squaring, which refines (5):
This brings a natural distinction between two kinds of input:
Definition 1: Input x is called generic if α ≤ |x| < Ω , and special otherwise.
By Property 1 every subnormal input is special, so that generic input consist of normal numbers only. The corresponding output •(x 2 ) must be finite because of the "only if" part in Property 2, but it can be (sub)normal or zero.
Setting C spec = [x is special] (with [S] = 1 if S is true, 0 otherwise) allows us to translate (8) into a high-level algorithmic description based on 3 independent tasks T i :
Thanks to if-conversion, the generated assembly for the above algorithm will consist of a straight-line piece of code computing the result R i of each task T i and ending with a 'slct' instruction that selects R 2 if R 1 is true, R 3 otherwise. For the design and implementation of each task we proceed in 2 steps, as in [10] : assuming unbounded parallelism we optimize the a priori most expensive task first, namely task T 3 (see §IV), and then only T 1 and T 2 , by trying to reuse as much as possible what was computed for T 3 (see §V). The latency of 'slct' is of 1 cycle, so the lowest latency we can expect for squaring is 1 cycle more than that of T 3 .
IV. COMPUTING CORRECTLY-ROUNDED SQUARES
In this section we consider the computation of •(x 2 ) for x generic, that is, x as in (1) and such that
By Property 1 such an x is normal, and thus 1 ≤ m < 2.
A. Parameterized formula for
Normalized representation of the exact square.
−c and e = c+2e. Then 1 ≤ m < 2, and x 2 = m · 2 e is the so-called normalized representation of the exact square. Tight bounds for e are as follows.
Property 4: The normalized exponent e of x 2 satisfies 2 (e min − p)/2 ≤ e ≤ e max . Note that the lower bound is less than e min since p ≥ 2.
Correctly-rounded value of the exact square. When e ≥ e min we have x 2 ∈ [2 emin , 2 emax+1 ) and, since m ∈ [1, 2),
When e < e min the exact square ranges in (0, 2 emin ). In this case, we first set the exponent to e min and only then round the resulting scaled significand in fixed point:
with• the function that rounds the reals in [0, 2) in the same direction as • but on the grid {i · 2 1−p : i = 0, 1, . . . , 2 p }. To handle (10a) and (10b) simultaneously, let us define µ = max(c, e min − 2e).
Since• coincides with • on [1, 2), the correctly-rounded value of the exact square is given in both cases by
where
By construction one has 0 < < 2 and e min ≤ d ≤ e max . The property below further gives tight bounds for µ. Property 5: One has c ≤ µ ≤ p + with = p is odd . An explicit formula for•( ). Using ∨ for logical OR, the guard bit and sticky bit of are, respectively,
The correctly-rounded value of is then given bỹ
where the round bit b satisfies (see [4, pp. 436-437 
B. Implementation for the binaryk floating-point format
We show here how to implement the computation of r = •(x 2 ) using k-bit integer arithmetic and logic. We assume x given by its standard encoding into an unsigned k-bit integer X. Since r satisfies (12a), it is known (see e.g. [11, §2.3.1] ) that its standard integer encoding is R = D·2 p−1 + L, where
and
and b the rounding bit defined in (15), so that eventually
Thus, computing R amounts to deducing D, L, b from X, which we detail now for the binaryk format. For binary32 and • = RN the resulting C code is displayed in Listing 1. To get µ, recall first that x normal implies e = E − e max . Then, recalling that e min = 1 − e max and applying (11),
To get c, it suffices to note that
To get the (possibly negative) integer F , first we extract 2E from X in (3) by using the identity
and then we subtract 2E from the constant e max + 1. For the binary32 format the computation of c and F appears at line 3 of Listing 1, where (b504f 333) 16 is M 0 for k = 32. Let us now see how to deduce L from M and µ. The property below shows that the k most significant bits of the 2k-bit integer M 2 are enough for that purpose. Property 6: L = H/2 µ+w−1 with H = M 2 /2 k . Given M the mul instruction computes the higher half H of M 2 , which then, by Property 6, simply has to be shifted right by µ + w − 1 in order to yield L. For the binary32 format, this appears at lines 5 and 6 of Listing 1. Since p = 24 is even, Property 5 implies µ + 7 ≤ 31, which agrees with the C99 specification of the bitwise shift operator [12, p. 84] .
Computing b. We focus here on rounding to nearest even, for which b = g ∧ ( p−1 ∨ t) with g and t as in (13). Note first that g is the least significant bit of G = · 2 p = 0≤i≤p i 2 p−i and that, similarly to Property 6, we have g = G mod 2 and G = H/2 µ+w−2 . For binary32, the corresponding C code appears at lines 6 and 9 of Listing 1. Second, since p−1 is the least significant bit of L we have p−1 = L mod 2. Third, the sticky bit t shall be obtained without computing the lower half of the exact square With these codes the expected latency of R on ST231 drops from 10 to 7 (resp. 9) cycles for • = RD (resp. RU).
V. DETECTING AND HANDLING SPECIAL INPUT
We first have to decide whether input x is special or not, that is, to get from X the value of C spec = [x is special]. By Definition 1 we have C spec = C small ∨ C large ∨ C nan with C small = |x| < α , C large = |x| ≥ Ω , and C nan = x is NaN . The next two properties show how to obtain C small and C large ∨ C nan by reusing the value 2E computed for the generic case (see (20) and line 2 of Listing 1).
For the binary32 format, a C fragment implementing C spec by means of 2E and the two previous properties is shown at lines 2 to 4 of Listing 2. As C spec is independent of •, this code holds not only for • = RN but also for • ∈ {RD, RU}. Once special input have been filtered out, it remains to return, for the given rounding mode •, the standard integer encoding R of the associated result r prescribed by (8) :
When • is RN. We deduce from (6) and (8) that for x special, r must be +0 if |x| < α , qNaN if x is NaN, and +∞ otherwise. Implementing this is then straightforward as (4) implies that 0, 2 k−1 − 2 p−1 , and 2 k−1 − 2 p−2 are standard encodings of +0, +∞, and qNaNs, and that
For the binary32 format, the computation of C nan appears at lines 1 and 2 of Listing 2, while lines 6 to 9 display the computation of R.
When • is not RN. For • = RD the only difference with the previous case is when |x| ≥ Ω and by (6) and (8), we now have r = max(|x|, Ω). The standard integer encoding of Ω is 2 k−1 −2 p−1 −1, that is (7f 7f f f f f ) 16 for the binary32 format. Hence it suffices to replace line 9 of Listing 2 with: else return maxu(absX, 0x7f7fffff); For • = RU, the specification differs from that for • = RN only in the case where |x| < α , for which we have r = min(|x|, α). Since the standard integer encoding of α is 1, an implementation for the binary32 format follows by simply replacing line 6 in Listing 2 by if (Csmall) return minu(absX, 1);
VI. EXPERIMENTAL RESULTS OBTAINED ON ST231
The C codes detailed in § §IV and V yield a full implementation of squaring, for the binary32 format and each rounding mode. To check correctness we compiled them with gcc (using a C emulation of mul, max, maxu, and minu as in [11, Appendix B] with the ST231 latency constraints and assuming unbounded parallelism; these best latencies follow from our analysis in § §IV and V and have the form L + 1 with L the best latency for the generic case: the latencies achieved in practice are thus at most 1 cycle from the best possible ones. The values in parentheses are instruction numbers N and IPC values N/L. The IPC achieved is close to the highest ILP reachable with the architectural constraints of the machine. For comparison, the third column gives the latencies of the multiply operator of FLIP 1.0, optimized for the ST231 [1] .
As shown in the fourth column, our specialization of this multiply operator into a square operator yields a speedup between 1.75 and 2.3, depending on the rounding mode.
