Abstract-In this paper, we show how to reduce the computation of correctly rounded square roots of binary floating-point data to the fixed-point evaluation of some particular integer polynomials in two variables. By designing parallel and accurate evaluation schemes for such bivariate polynomials, we show further that this approach allows for high instruction-level parallelism (ILP) exposure, and thus, potentially low-latency implementations. Then, as an illustration, we detail a C implementation of our method in the case of IEEE 754-2008 binary32 floating-point data (formerly called single precision in the 1985 version of the IEEE 754 standard). This software implementation, which assumes 32-bit unsigned integer arithmetic only, is almost complete in the sense that it supports special operands, subnormal numbers, and all rounding-direction attributes, but not exception handling (that is, status flags are not set). Finally, we have carried out experiments with this implementation on the ST231, an integer processor from the STMicroelectronics' ST200 family, using the ST200 family VLIW compiler. The results obtained demonstrate the practical interest of our approach in that context: for all rounding-direction attributes, the generated assembly code is optimally scheduled and has indeed low latency (23 cycles).
Ç

INTRODUCTION
T HIS paper deals with the design and software implementation of an efficient sqrt operator for computing square roots of binary32 floating-point data. As mandated by the IEEE 754 standard (whose initial 1985 version [1] has been revised in 2008 [2] ), our implementation supports gradual underflow and the four rounding-direction attributes required for binary format implementations. However, the status flags used for handling exceptions are not set.
As for other basic arithmetic operators, the IEEE 754 standard specifies that sqrt operates on and returns floating-point data. Floating-point data are either special data (signed infinities, signed zeros, not-a-numbers (NaN)) or nonzero finite floating-point numbers. In radix two, nonzero finite floating-point numbers have the form x ¼ AE m Á 2 e , with e an integer such that 
Since m is nonzero, the number of its leading zeros varies in f0; 1; . . . ; p À 1g. If jxj < 2 emin , then x is subnormal; else it is normal. On one hand, subnormal numbers are such that e ¼ e min and > 0. On the other hand, normal numbers will be considered only through their normalized representation, that is, the unique representation of the form AEm Á 2 e for which ¼ 0.
The parameters e min , e max , and p used so far represent the extremal exponents and the precision of a given binary format. In this paper, we assume that they satisfy e min ¼ 1 À e max and 2 p e max :
This assumption is verified for all the binary formats defined in the IEEE 754-2008 standard [2] . This standard further prescribes that the operator sqrt: x 7 ! r specifically behaves as follows:
. If x equals either À0, þ0, or þ1, then r equals x. . If x is nonzero negative or NaN, then r is NaN. These two cases cover what we shall call special operands. In all other cases, x is a positive nonzero finite floating-point number, that is,
with e as in (1) and m as in (2); the result specified by the IEEE 754-2008 standard is then the so-called correctly rounded value
where is any of the four rounding-direction attributes: to nearest even (RN), up (RU), down (RD), and to zero (RZ). In fact, since ffiffiffi x p ! 0, rounding to zero is the same as rounding down. Therefore, considering only the first three roundingdirection attributes is enough: 2 fRN; RD; RUg:
As we will see in Section 2, deducing r from x essentially amounts to taking the square root, up to scaling, of the significand m. For doing this, many algorithms are available (see, for example, the survey [3] and the reference books [4] , [5] , [6] ).
The method we introduce in this paper is based exclusively on fixed-point evaluation of a suitable bivariate polynomial with integer coefficients. Since polynomial evaluation is intrinsically parallel, this approach allows for very high ILP exposure. Thus, in some contexts such as VLIW integer processors, a significant reduction of latency can be expected. For the ST231, an STMicroelectronics VLIW processor which does not have native floating-point capabilities, we show in this paper that this is indeed the case: the assembly codes generated for optimized implementations of our approach turn out to be optimally scheduled and have latencies reduced by over 30 percent compared to previously fastest available methods, implemented in [7] ; also, the latency overhead for subnormal support is only two cycles, yielding a latency of 23 cycles for all . This latency for square root compares favorably with the latencies of addition and multiplication, which, in the same context, range, respectively, from 23 to 27 and 18 to 21 cycles, depending on (see http://flip.gforge.inria.fr/ and [8, p. 4 
]).
The paper is organized as follows: Section 2.1 details three mathematical properties of square roots of binary floating-point numbers and Section 2.2 shows how to use them to deduce the usual high-level algorithmic description of the sqrt operator.
In Section 3.1, we then show how to introduce suitable bivariate polynomials that approximate our square root function. In particular, we give some approximation and evaluation error bounds that are sufficient to ensure correct rounding, along with an example of such a polynomial and its error bounds in the case of the binary32 format. Section 3.2 then details, for each rounding-direction attribute, how to deduce a correctly rounded value from the approximate polynomial value obtained so far. A summary of our new approach is given in Section 3.3.
A standard C99 implementation of this approach is given in Section 4, for the binary32 format and assuming that 32-bit unsigned integer arithmetic is available: Section 4.1 shows how to handle special operands. Section 4.2 deals with the computation of the result exponent and a parity bit related to the input exponent (which is needed several times in the rest of the algorithm). Sections 4.3 and 4.4 show how to compute the evaluation point and the value of the polynomial at this evaluation point; there, we also explain how the accuracy of the evaluation scheme has been verified. Finally, Section 4.5 details how to implement correct rounding. This results in three separate square root implementations, one for each rounding-direction attribute.
For these implementations, all that is needed is a C compiler that implements 32-bit arithmetic. However, our design has been guided by a specific target, the ST231 VLIW integer processor from STMicroelectronics' ST200 family.
Section 5 describes about some experiments carried out with this target and the ST200 family VLIW compiler. After a review of the main features of the ST231 in Section 5.1, the performance results we have obtained in this context are presented and analyzed in Section 5.2.
PROPERTIES OF FLOATING-POINT SQUARE ROOTS AND GENERAL ALGORITHM
The general scheme for moving from (3) to (4) essentially follows from three properties of the square root function in binary floating-point arithmetic.
Floating-Point Square Root properties
Property 2.1. For x in (3), the real number ffiffiffi x p lies in the range of positive normal floating-point numbers, that is,
This first property (see [9] for a proof) implies that ð ffiffiffi x p Þ is a positive normal floating-point number. Correctly rounded square roots thus never denormalize nor under/overflow, a fact which will simplify the implementation considerably.
In order to find the normalized representation of ð ffiffiffi
It follows that the positive (sub)normal number x defined in
and that m 0 2 ½1; 2Þ. Taking the square root then yields ffiffiffi
where the real ' and the integer d are given by
and, using b c to denote the usual floor function,
It follows from the definition of and m 0 that ' is a real number in ½1; 2Þ. Therefore, rounding ffiffiffi x p amounts to round ', and we have shown the following property: Property 2.2. Let x; ', and d be as above. Then,
In general, the fact that ' 2 ½1; 2Þ only implies the weaker enclosure ð'Þ 2 ½1; 2. This yields two separate cases: if ð'Þ < 2, then Property 2.2 already gives the normalized representation of the result r; if ð'Þ ¼ 2, then we must further correct d after rounding, in order to return r ¼ 1 Á 2 dþ1 instead of the unnormalized representation 2 Á 2 d given by Property 2.2. The next property, proved in [9], characterizes precisely when such a correction is needed or not. In particular, it is never needed in rounding-to-nearest mode. 
High-Level Description of Square Root Algorithms
Together with the IEEE 754-2008 specification recalled in Section 1, the properties of Section 2.1 lead to a general algorithm for computing binary floating-point square roots, shown in Figs. 1 and 2 for the different rounding-direction attributes. In particular, ð'Þ and d are two functions of m and e, which can be computed independently from each other.
(We will see in Section 4.2 that the correction step in Fig. 2 can in fact be avoided when using the standard encoding of binary floating-point data.) Given m and e, computing d is algorithmically easy since by (5) and (7), we have
However, computing ð'Þ from m and e is far less immediate and typically dominates the cost. In the next section, we present a new way of producing ð'Þ, which we have chosen because we believe that it allows to express the most ILP.
COMPUTING ð'Þ BY CORRECTING TRUNCATED APPROXIMATIONS
It is useful to start by characterizing the meaning of ð'Þ for each rounding-direction attribute . Since the real number ' belongs to [1, 2) , we have the following, where n is a normal number:
. RNð'Þ is the unique n such that
. RDð'Þ is the unique n such that
. RUð'Þ is the unique n such that
Both inequalities in (9) are strict, since the square root of a floating-point number cannot be exactly halfway between two consecutive floating-point numbers (see [4, Theorem 9 .4], [5, p. 242 ], or [6, p. 463] ). Note also that (9) and (10) both imply n 2 ½1; 2Þ, while (11) implies the weaker enclosure n 2 [1, 2] . Given p, , m 0 , and , there are many ways of producing the bits of such an n. A way that will allow to express much ILP is by correcting a truncated approximation of '. This approach (detailed, for example, in [6, p. 459] for division) has three main steps as follows:
. Compute a "real number" v approximating ' from above with absolute error less than 2 Àp , that is,
. Deduce u by truncating v after p fraction bits:
. Obtain n by adding, if necessary, a small correction to u and then by truncating after p À 1 fraction bits. Of course, the binary expansion of v in (12) will be finite: by "real number," we simply mean a number with precision higher than the target precision p. Typically, v will be representable with at most k bits, with k the register width (for example, p ¼ 24 and k ¼ 32 for our implementation in Section 4). On the other hand, using ' ffiffi ffi
Therefore, the binary expansion of v has the form
Our approach for computing v as in (12) and (15) by means of bivariate polynomial evaluation is detailed in Section 3.1 below.
Once v is known, truncation after p fraction bits gives
The fraction of u is wider than that of n by 1 bit. We will see in Section 3.2 how to correct u into n by using both this extra bit and the fact that, because of (12) and (13),
Computing v by Bivariate Polynomial Evaluation
The problem is, given p, m 0 , and , to compute an approximation v to ' such that (12) holds. Such a value v will in fact be obtained as a solution to
Although slightly more strict than (12) , this form will be the natural result of some derivations based on the triangle inequality.
Computing v such that (18) holds usually relies on iterative refinement (Newton-Raphson or Goldschmidt method [6, Section 7.3]), or univariate polynomial evaluation [7] (see also [10] for a different context, namely when an FMA instruction is available), or a combination of both [11] , [12] . The approach we shall detail now is based exclusively on polynomial evaluation. However, instead of using two polynomials as in [7] , we will use a single one, which is bivariate; this makes the approach simpler, more flexible, and eventually, faster provided some parallelism is available.
The main steps for producing v via bivariate polynomial evaluation are shown on the diagram below:
First, using (6) and defining
the real number ' þ 2 ÀpÀ1 can be seen as the value at ðs; tÞ ¼ ð; Þ of the bivariate function
Then, let
be the variation domains of and , respectively, and let
Since T & ½0; 1 À , a second step is the approximation of F ðs; tÞ on f1; ffiffi ffi 2 p g Â ½0; 1 À by a bivariate polynomial P ðs; tÞ.
The function F being linear with respect to its first variable, a natural choice for P is
with aðtÞ a univariate polynomial that approximates ffiffiffiffiffiffiffiffiffiffi 1 þ t p on ½0; 1 À . The third and last step is the evaluation of P at ð; Þ by a finite-precision, straight-line program P, the resulting value Pð; Þ being assigned to v.
Intuitively, if aðtÞ is "close enough" to ffiffiffiffiffiffiffiffiffiffi 1 þ t p over the whole interval ½0; 1 À and if P evaluates P "accurately enough" on the whole domain S Â T , then the resulting value v should be close enough to ' þ 2 ÀpÀ1 in the sense of (18) . This claim is made precise by the next lemma.
Lemma 1. Given p, , , a, P , and P as above, let ðaÞ be the approximation error defined by
and let ðPÞ be the rounding error defined by ðPÞ ¼ max ðs;tÞ2SÂT P ðs; tÞ À Pðs; tÞ j j :
then v satisfies (18).
Proof. Using the definitions of F and P , we have
Since 1 ffiffi ffi 2 p and 2 ½0; 1 À , it follows that
which concludes the proof. t u
It remains to find a polynomial approximant a together with an evaluation program P so that ðaÞ and ðPÞ satisfy (22) . In practice, since ðaÞ and ðPÞ may be real numbers, a certificate will consist in computing two dyadic numbers d and d such that
The construction of a and P is highly context dependent: it is guided by both the value of p and some features of the target processor (register precision k, instruction set, latencies, degree of parallelism,. . . ). The two paragraphs below illustrate how to choose a and P in the case where k ¼ 32 and p ¼ 24.
Constructing a Polynomial Approximant
Since P in (21) will be evaluated at runtime, a small degree for a is usually preferred. One may guess the smallest possible value of degðaÞ as follows: The rounding error ðPÞ in (22) being nonnegative, a must satisfy
Now let IR½t d be the set of univariate real polynomials of degree at most d and recall, for example, from [13, Section 3.2] , that the minimax degree-d approximation of
Thus, by combining (23) and (24), one must have
A lower bound on degðaÞ can then be guessed by estimating Table 1 indicates that a should be of degree at least eight.
1. Such estimations can be computed, for example, using Remez' algorithm, which is available in Sollya (http://sollya.gforge.inria.fr/); see also [13, Section 3.5], [14] , [15] . Once we have an idea of the smallest possible degree for a, it remains to find a machine representation of a. For this representation, a typical choice is the monomial basis 1; t; t 2 ; . . . together with some coefficients a 0 ; a 1 ; a 2 ; . . . that are exactly representable using at most k bits.
Continuing the previous example where k ¼ 32, it turns out that aðtÞ ¼ P 8 i¼0 a i t i can be chosen such that
with the following values for the 21 . (See Listing 3 for their hexadecimal values). These integers A i were found by truncating the coefficients obtained after several calls to Sollya's Remez algorithm and by favoring powers of two. Each of them can be stored using only 32 bits and four of them are powers of two. Note also that
A certified supremum norm computation (implemented, for example, in Sollya; see also [16] , [17] ) applied to this particular polynomial gives a bound less than 2 À25:5 as required by (23) . (The computed bound has the form 2 À25:972... .)
Writing an Evaluation Program
In our context, the evaluation program P is typically a piece of C code that implements a finite-precision computation of P ðs; tÞ. It should be accurate enough in the sense of (22), and since we favor latency (rather than throughput, for example), as fast as possible.
Such an implementation will not require using floatingpoint arithmetic, and fixed-point arithmetic will suffice to evaluate P ðs; tÞ accurately enough. In addition, we have the lemma below, which shows that P ðs; tÞ lies in a fairly small range. Lemma 2. If ðs; tÞ 2 S Â T , then P ðs; tÞ 2 ð1; 2Þ.
ÀpÀ3=2 . It follows from the definition of ðaÞ and the bound in (23) Rounding the evaluation point. When designing an evaluation program P that achieves this accuracy, a preliminary step is to make the input ð; Þ machinerepresentable. On one hand, the binary expansion of is 0:m þ1 . . . m pÀ1 , and thus, since is nonnegative, is already representable using k bits provided p À 1 k. On the other hand, writing RN k for rounding-to-nearest in precision k, we shall replace defined in (6) by
The lemma below gives an upper bound on the loss of accuracy that occurs when rounding ð; Þ to ðb ; Þ. 
Designing an evaluation program. By evaluation program, we mean a division-free straight-line program, that is, roughly, a set of instructions computing P ðs; tÞ from s, t, p and the a i s by using only additions, subtractions, and multiplications, without branching. In our context, we shall assume first unbounded parallelism and some latencies for addition/subtraction and multiplication. Then we parenthesize the expression P ðs; tÞ in order to expose as much ILP as we can, and thus, to decrease the overall latency of polynomial evaluation. An example of such a parenthesization, generated automatically in the same spirit as [18] , is P ðs; tÞ
Note that t 2 and s Á t 2 are common subexpressions. With unlimited parallelism and latencies of 1 for addition, of 3 for (pipelined) multiplication, and of 1 for multiplication by a power of two (that is, a shift), such a parenthesization gives a latency of 13 (compared to 34 for Horner's scheme). We have found no parenthesization of smaller latency.
Accuracy issues. In our example, the rounding error of the program P that implements in fixed-point arithmetic the above parenthesization must satisfy (29). This requirement will be checked in Section 4.4.
For now, let us notice that this parenthesization can in fact be implemented using 32-bit unsigned integers only, which avoids to loose one bit of precision because of the need to store the sign of the coefficients a i . Indeed, an appropriate choice of arithmetic operators can be found, which ensures that all intermediate variables are positive:
Correction to Ensure Correct Rounding
For each rounding-direction attribute , we will now obtain n ¼ ð'Þ by correcting u in (16) and (17) . How to correct u depends on whether u is above or below '. Thus, the rounding algorithms below rely on either u ! ' or u > ', which can both be implemented exactly (see Section 4.5).
Rounding to Nearest
An algorithm producing n as in (9) is:
If v p ¼ 0, the above algorithm always returns the value u; this is the desired result, for in this case, u is already a floating-point number, and thus, (17) 
Àp : the former is returned when u > ', the latter when u < ', and (17) implies (9) in both cases; the case u ¼ ' never happens because ' cannot be a midpoint.
Rounding down
An algorithm producing n as in (10) Àp , which, because of (17), satisfies (10) as required. If v p ¼ 0, then u is already a floating-point number: if u > ', then u ! 1 þ 2 1Àp and the algorithm returns u À 2 1Àp , which is the floating-point predecessor of u, by truncating the midpoint u À 2 Àp ; if u ', the returned value is u; in both cases, using (17) yields (10).
Rounding up
An algorithm producing n as in (11) is:
If v p ¼ 1, this algorithm always produces the floatingpoint number u þ 2
Àp , which, because of (17), satisfies (11) as required. If v p ¼ 0, then u is already a floating-point number: if u ! ', the algorithm returns u; if u < ', it returns u þ 2 1Àp , which is the floating-point successor of u 2 À 2 1Àp ; in both cases, using (17) yields (11).
Summary: Main Steps of the Computation of ð'Þ
The box "Compute ð'Þ" in Figs. 1 and 2 can be replaced by the ones in Fig. 3 below. This figure recapitulates the main steps of the approach we have proposed in Sections 3.1 and 3.2 for deducing the correctly rounded value ð'Þ from m and e.
IMPLEMENTATION DETAILS
The above square root method, which is summarized in Figs. 1, 2, and 3, can be implemented by operating exclusively on (unsigned) integers. We will now detail such an implementation, in C and for the binary32 format of [2] . For this format, the storage bit width, precision, and maximum exponent are, respectively, k ¼ 32; p¼ 24; e max ¼ 127:
When writing the code, we have essentially assumed unbounded parallelism and that 32-bit integer arithmetic is available. Additional assumptions on the way input and output are encoded, and on which operators are available, are as follows:
Input and output encoding. The operand x and result r of sqrt are implemented as integers X; R 2 f0; 1; . . . ; 2 32 À 1g whose bit strings correspond to the standard encoding of binary floating-point data (see [2, Section 3.4] ). Our implementation of sqrt is thus a C function as in line 1 of Fig. 4 (Listing 2), which returns the value of R. Table 2 indicates some useful relationship between x and X that are a consequence of this encoding. (Of course, the same would hold for r and R.) Also, the bit string of X must be interpreted as follows: its first bit X 31 gives the sign of x; the next 8 bits encode the biased exponent E of x as E ¼ P 7 i¼0 X iþ23 2 i ; the last 23 bits define the trailing significand field. In particular, if x is (sub)normal, then
. the biased exponent E is related to the exponent e in (1) as follows:
where ½ > 0 ¼ 1 if x is subnormal, 0 otherwise; . the trailing significand field carries the bits of m in (2) as follows: . maxðA; BÞ, . the number nlzðAÞ of leading (that is, leftmost) zeros of the bit string of A, and . bAB=2 32 c, whose bit string contains the 32 most significant bits of the product AB. In our C codes, the operators corresponding to these functions will be, respectively, written max, nlz, and mul for readability. More precisely, with the ST231 target in mind, we shall assume that the latencies of max and nlz are of one cycle, while the latency of mul is of three cycles. Furthermore, we shall assume that at most two instructions of type mul can be launched simultaneously at every cycle. (How to implement max, nlz, and mul in C is detailed in the Appendix.)
From Sections 4.2 to 4.5, the operand x will be a positive (sub)normal number. In this case, X 31 ¼ 0 and since the result r is a positive normal number, R 31 ¼ 0 as well. Therefore, it suffices to determine the 8 bits R 30 ; . . . ; R 23 , which give the (biased) result exponent D ¼ P 7 i¼0 R iþ23 2 i , and the 23 bits R 22 ; . . . ; R 0 , which define the trailing significand field of the result.
Handling Special Operands
For square root, the floating-point operand x is considered special when it is AE0, þ1, less than zero, or NaN. 
All special operands can thus be detected using (33). The results required by the IEEE 754-2008 standard for the square root of such operands are listed in Table 3 .
Note that there are essentially only two cases to consider: r is either x or qNaN. The first case occurs when x 2 fþ0; þ1; À0g, which, when x is known to be special, is a condition equivalent to whose bit string consists of one zero followed by nine ones followed by 22 zeros. Note that the quiet NaN thus produced keeps as much of the information of X as possible, as recommended in [2, Section 6.2]; in particular, the payload is preserved when quieting an sNaN, and if x is a qNaN, then x is returned. Using the fact that the addition of two 32-bit unsigned integers is done modulo 2 32 and taking the hexadecimal values of the constants in (33) and (34), we finally get the C code shown in Listing 1 for handling special operands. In this code, note that the four operations +, <=, ==, and | on X are independent of each other. Listing 1. Code for handling special operands.
Remark that (33) and (34) extend immediately to other standard formats like binaryk with k ¼ 16; 64; 128; . . . . Thus, if integer arithmetic modulo 2 k is available and if the standard encoding is used for binaryk data, special values can be handled in the same way as shown here.
Computing the Biased Value of d and Parity of e 0
By Property 2.1, the result r cannot be subnormal. Therefore, by applying (31), we deduce that the biased value D of the exponent d of r is always given by
In order to compute D from X, we use first the expression of d in (8) and the relation (31) to obtain
Then, using (32) and the second and third rows of Table 2 , we deduce that the number of leading zeros of X is þ 8 when > 0, and at most 8 when ¼ 0. Hence,
An immediate consequence of this is that ½ > 0 ¼ ½M > 8. However, more instruction-level parallelism can be obtained by observing in Table 2 that
Formula (35) for the biased exponent D thus becomes Remark that in rounding-up mode, Property 2.3 requires that the integer D obtained so far be further incremented by one when
À23 and e is odd:
(This correction has also been illustrated in Fig. 2.) Since (39) implies that x is a normal number, D must be replaced by D þ 1 if and only if X ! 2 23 and the last 24 bits of X are one zero followed by 23 ones. An implementation of this update is thus:
In fact, this treatment specific to rounding up can be avoided by exploiting the standard encoding as follows: Recall that n ¼ ð'Þ is in ½1; 2 and has at most 23 fraction bits, and that we want the bit string 0R 30 . . . . Hence, we implemented the computation of D À 1 using the formula below, which directly follows from (38):
This corresponds to the computation of variable Dm1 at line 7 of Fig. 4 (Listing 2). The only difference with the implementation of D given right after (38) occurs at line 6, where we perform B + 133 instead of B + 135.
Let us now turn to the parity of e 0 in (5), which will be needed in Sections 4.3 and 4.5. Using (31), (36), and (37), we deduce that e 0 is even if and only if the last bit of E À M þ ½X < 2 23 is equal to 1. Since the latter expression already appears in (38) or (40), an implementation follows immediately: An alternative code, which uses only logical operators and unsigned integers, consists in taking the XOR of the last bit of E þ ½X < 2 23 and of the last bit of M: 
and given the value of the integer even (see Section 4.2), computing S can be done as shown at line 8 in Fig. 4 (Listing 2) .
The number in (19) satisfies ¼ ð0:m þ1 . . . m 23 Þ 2 . Therefore, it can be viewed as a Q0.32 number
where T ¼ P 31 i¼0 T i 2 i is the integer in ½0; 2 32 Þ such that
By (32) and (36), we see that T can be computed by shifting X left M þ 1 positions. Since M is not immediately available, more ILP can be exposed by implementing this shift as in line 7 of Fig. 4 (Listing 2) . Also, for the shift by M ¼ maxðnlzðXÞ; 8Þ to be well defined here, the C standard [19] requires that 0 M < 32. One may check that this is indeed the case: since, by assumption, x is nonzero, the number of leading zeros of X is less than 32, and thus, 8 M < 32.
Computing the Approximate Polynomial Value v
Listing 3 below shows an implementation of the evaluation scheme (30) using 32-bit unsigned integers and the identities in (25) . Multiplications by coefficients a power of two (like A 1 , A 2 , A 8 ) are implemented as simple shifts. Assuming a latency of one for additions, subtractions, and shifts, a latency of three for (pipelined) multiplications, and that at most two multiplications can be started simultaneously, this code can be scheduled in at most 13 cycles, as shown in Table 4 . There the dashes "-" indicate that an instruction requires an additional slot because it uses an extended immediate, see Section 5.1.
The numerical quality of the code in Listing 3 has been verified using the Gappa software (see http://gappa. gforge.inria.fr/ and [20] , [21] ; see also [22, Section 4] for some guidelines on how to translate a C code into Gappa syntax). With this software, we first checked that all variables r 0 ; . . . ; r 21 are indeed integers in the range ½0; 2 32 Þ. Then we used Gappa to compute a certified upper bound on the final rounding error; the bound produced is less than 2 À27:93 , and thus, less than the sufficient bound in (29). The Gappa script we wrote to perform this certification step is contained in the file binary32sqrt.gappa available at http://prunel.ccsd.cnrs.fr/ensl-00335792.
Implementing the Rounding Tests
There the only nontrivial part is to evaluate the expressions u ! ' (used when rounding to nearest and rounding up, see Sections 3.2.1 and 3.2.3), and u > ' (used when rounding down, see Section 3.2.2). It turns out that such comparisons can be implemented exactly by introducing three integers P , Q, and Q 0 that we shall define below by considering u 2 and ' 2 instead of u and '.
Listing 3. Bivariate polynomial evaluation code. 
with U the integer in ½0; 2 32 Þ whose bit string is ½ 0 1 v 1 . . . v 24 0 0 0 0 0 0 :
Let P be the integer in ½0; 2 32 Þ such that P ¼ mulðU; UÞ:
It follows from (44) and the definition of mul that
With
and can be represented exactly with 24 À bits. Since 24 À 32, several encodings into a 32-bit unsigned integer are possible. Because of (45) and the need to compare ' 2 with u 2 , a natural choice is to encode ' 2 into the integer Q 2 ½0; 2 32 Þ such that
An implementation of the computation of Q using the integer T defined in (42) and (43) and the parity of e 0 can be found at line 14 of Fig. 4 (Listing 2) .
Rounding to Nearest and Rounding up
Once the values of P and Q are available, the condition u ! ' used when 2 fRN; RUg can be evaluated due to the following characterization: Property 4.1. The inequality u ! ' holds if and only if the condition P >= Q is true.
Proof. Since u and ' are nonnegative, u ! ' is equivalent to
2 then, by (46) and the left inequality in (45), we deduce that P þ 1 > Q, which is equivalent to P ! Q for P and Q integers. Conversely, if P ! Q, then multiplying both sides by 2 À28 gives P Á 2 À28 ! ' 2 , and using the right inequality in (45), u 2 ! ' 2 . To sum up, u ! ' if and only if P ! Q, that is, if and only if the C condition P >= Q is true. t u
Together with the algorithm of Section 3.2.1, this property accounts for the implementation of rounding to nearest at lines 14-18 in Fig. 4 (Listing 2) .
Since rounding up depends on the condition u ! ' as well (see Section 3.2.3), this rounding mode can be implemented easily: simply replace lines 15-18 in Fig. 4 (Listing 2) with the following code fragment:
Rounding down
According to the algorithm in Section 3.2.2, rounding down does not rely on the condition u ! ' but on the condition u > ' instead.
In order to implement the condition u > ', let Q 0 2 f0; 1g be such that Q 0 ¼ 1 if and only if equality P Á 2 À28 ¼ u Proof. Since u and ' are nonnegative, u > ' is equivalent to u 2 > ' 2 . We consider the two cases Q 0 ¼ 1 and Q 0 ¼ 0 separately. Assume first that Q 0 ¼ 1. Then, using the equality in (45) together with (46), we see that u > ' is equivalent to P > Q, that is, since P and Q are integers, to
, then, using the left inequality in (45), we find P þ 1 > Q, and thus, P ! Q. Therefore, u > ' if and only if P ! Q þ Q 0 . Now, recalling that ' 2 ½1; 2Þ, we deduce from (46) that Q < 2 30 . Hence, Q þ Q 0 always fits in a 32-bit unsigned integer. Consequently, the mathematical condition P ! Q þ Q 0 is equivalent to the C condition P >= Q + Qprime.
t u
Together with the algorithm of Section 3.2.2, the above property gives the following implementation of rounding down:
EXPERIMENTS WITH THE ST231 CORE
Combining the codes of the previous section leads immediately to a standard C implementation of sqrt for the binary32 format, with subnormal support and correct rounding for each rounding-direction attribute RN, RU, RD.
For validation purpose, each of these three versions has been compiled with Gcc under Linux (using the software given in the Appendix for emulating max, nlz, mul). This allowed for an exhaustive comparison, within a few minutes, against the square root functions of GNU C (glibc) 2 and GNU MPFR. 3 We also compiled these three versions with the Open 64-based ST200 family VLIW compiler from STMicroelectronics, in -O3 and for the ST231 core. After a review of the main features of the ST231 in Section 5.1, the performance results we have obtained in this context are presented and analyzed in Section 5.2.
Some Features of the ST231
The ST200 family of VLIW microprocessors originates from the joint design of the Lx by HP Labs and STMicroelectronics [23] . The ST231 is the most recently designed core of this family and is widely used in STMicroelectronics SOCs for multimedia acceleration.
In this processor, that executes up to four integer instructions per cycle, all arithmetic instructions operate on the 64 32-bit register file and on the eight one-bit branch register file.
Resource constraints must be observed to form proper instruction bundles containing one to four instructions: only one control instruction, one memory instruction, and up to two 32 Â 32 ! 32 multiplications of type mul are enabled. Other instructions can be freely used, but are limited to integer only arithmetic, without division.
Another specificity of this architecture is that any immediate form of an instruction is, by default, encoded to use small immediates (9-bit signed), but can be extended to use extended immediates (32-bit), at the cost of one instruction per immediate extension. For instance, up to two multiplications mul each using a 32-bit immediate can be encoded in one bundle. This makes the usage of long immediate constants such as polynomial coefficients very efficient from a memory system standpoint. On the contrary, the absence of a sophisticated data cache model (such as L2 cache) implies a quite important cost of accessing a data table (up to 130 cycles in the case of a data cache miss).
To enable the reduction of conditional branches, the architecture provides partial predication support in the form of conditional selection instructions. In the assembly line below, $q, $r, and $s are 32-bit integer registers, $b is a one-bit branch register that can be defined through comparison or logical instructions:
This fragment of assembly code writes $q in $s if $b is true, $r otherwise. An efficient if-conversion algorithm based on the -SSA representation is used in the Open64 compiler to generate partially predicated code based on the slct instruction [24] .
The retargeting of the Open64 compiler technology to the ST200 family is able to generate efficient, dense, branch-free code for all the codes described in Section 4, requiring only the usage of one specific intrinsic to select the nlz instruction (number of leading zeros).
Performance Results for Square Root on ST231
The performances obtained when compiling for ST231 with the ST200 compiler are summarized in Table 5 .
Versions of our codes that do not support subnormals have been implemented too, whose performances are within square brackets.
Here are some observations:
. Due to if-conversion, the generated assembly codes are fully if-converted, straight-line programs. Thus, all latencies (numbers of cycles) and instruction numbers are independent of the input values. . Depending on , the value N varies as expected: compared to RN, Section 4.5.1 suggests one more addition for RU, while Section 4.5.2 suggests three more instructions for RD (get Qprime and add it to Q). . On the contrary, the value of L turns out to be independent of . In fact, with subnormal support, the value 23 is exactly the latency that can be expected from the codes of Section 4 when assuming unbounded parallelism. Indeed, the critical path consists of the following four subtasks:
compute the value T in three cycles, -compute the value V by bivariate polynomial evaluation in 13 cycles, -round correctly (truncation, squaring, comparison, and selection) in 1 þ 3 þ 1 þ 1 ¼ 6 cycles, and -select the final result in one cycle. In other words, our implementations with subnormal support are scheduled optimally by the ST200 compiler. The same occurs for our versions without subnormals: those were obtained by replacing
with T = X << 9; T = (X << 1) << M;
The cost of T drops from three cycles to one cycle, and one may check that the critical path is defined by the same subtasks as before. Hence, a new theoretical latency of 21 cycles, which again is achieved in practice. Finally, note that the overhead due to subnormal support is only of two cycles. . In all cases, the number of instructions per cycle (IPC) is larger than 2.6. A more precise description of bundle occupancy is given in Fig. 5 , assuming ¼ RN and subnormal support. Among the 23 bundles used, four contain four instructions and 12 contain three instructions. The instructions used for handling special operands are in light gray, while those corresponding to nonspecial operands are in black; instructions in dark gray are shared or used for selection. Note that, from the point of view of latency, handling special operands comes for free up to the last cycle, used for selecting the final result. Table 6 shows the influence of the evaluation scheme used for P ðs; tÞ ¼ 2 À25 þ s Á aðtÞ on the performances of the complete square root code. Instead of evaluating P ðs; tÞ as in (30) and Listing 3, we evaluate aðtÞ by three methods and then multiply by s and add 2 À25 . "Best univariate" here is one giving the lowest latency (12 cycles) we have found assuming unbounded parallelism. Comparing with the first line of Table 5 , we remark the following:
. Despite high-ILP exposure, our bivariate evaluation scheme increases the total number of instructions only by four (less than seven percent) compared to Horner's rule. . Our scheme yields a complete code that is about twice faster than when using Horner's rule. . We use four cycles less than when evaluating aðtÞ as fast as possible and then performing in sequence the multiplication by s and the addition by 2 À25 ("best univariate"). This shows the advantage of evaluating the bivariate polynomial P ðs; tÞ directly, allowing for s to be distributed within aðtÞ ¼ P i a i t i . Table 7 gives the latencies obtained in [7] (for rounding to nearest and without subnormals) by implementing various other methods on ST231. We denote by R and nR the classical restoring and nonrestoring algorithms. N2, G2, and G1 refer to some variants of Newton-Raphson/Goldschmidt methods, based on low-degree polynomial evaluation (for the initial approximation) followed by one or two iterations. P5 and P6 are methods based exclusively on piecewise, univariate polynomial approximants (three of degree 5 or two of degree 6).
Our version in 21 cycles is about seven times faster than the restoring method and twice faster than Goldschmidt's method G2. It is also 1.43 times faster than P6, which was the previously fastest known implementation of correctly rounded square root on ST231.
CONCLUSION
After detailing some properties of the square root function in binary floating-point arithmetic, we proposed, for its computation with correct rounding, a high ILP approach based on bivariate polynomial evaluation. To demonstrate the effectiveness of this approach, we further provided and analyzed complete C implementations, one for each rounding-direction attribute. Some experimental results finally showed that these codes yield optimal schedules and low latencies on a VLIW integer processor like the ST231. Also, the latency overhead for subnormal support is only two cycles.
Our approach can be extended in several ways. First, one may check that a faithfully rounded square root follows from replacing lines 12-18 of Fig. 4 (Listing 2) with return (Dm1 << 23) + (V >> 7); The latency then is 19 cycles with subnormals and 17 cycles without subnormals, leading to a speedup by a factor of more than 1.2 in both cases.
Second, our approach is not restricted to square rooting and can be adapted to other operators: it has already been employed 4 on ST231 for accelerating division [25] , reciprocal square root [26] , and more generally, x 1=n for jnj a "small" positive integer [8] . In fact, our approach is restricted to neither algebraic functions nor the bivariate case: it could in principle be used as soon as argument reduction yields an identity of the form ' ¼ F ðs; t; . . .Þ, for some multivariate function F . But it remains to determine which functions other than the ones above could benefit significantly from our approach.
Third, although we have focused in this paper on implementations for the binary32 format (single precision) and 32-bit architectures, similar codes could be derived for the binary64 format (double precision) and 64-bit architectures. The only major modification would be a new polynomial approximant together with a new bivariate evaluation scheme. However, for application reasons, we are currently mostly interested in supporting double precision on ST231. How to best use the underlying 32-bit integer arithmetic available on this chip in order to emulate efficiently a binary64 square root seems to be nontrivial. It thus remains to investigate whether the approach we have introduced here can be adapted to that context. 4 . The first draft of this paper is [9] and dates back to October 2008. (ParLab) at the University of California at Berkeley, he is now an associate professor at the Université de Perpignan Via Domitia, and a member of the LIRMM laboratory (LIRMM is a joint laboratory of CNRS and Université Montpellier 2) and the DALI research team (Université de Perpignan Via Domitia). His research interests include floating-point arithmetic, automatic generation and certification of floating-point programs, and automatic debugging of applications using floating-point arithmetic.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
