This paper presents a C library for the software support of single precision floating-point (FP) arithmetic on processors without FP hardware units such as VLIW or DSP processor cores for embedded applications. This library provides several levels of compliance to the IEEE 754 FP standard. The complete specifications of the standard can be used or just some relaxed characteristics such as restricted rounding modes or computations without denormal numbers. This library is evaluated on the ST200 VLIW processors from STMicroelectronics.
INTRODUCTION
Most embedded systems for digital signal processing, image processing and digital control rely on integer or fixed-point processors.
1 Indeed, for those applications, most fast and low-power processor cores have no FP unit for cost reasons (smaller area). When implementing algorithms dealing with real numbers on such processors, one has to introduce some scaling operations, in the target program, in order to keep accurate computations. 2 The insertion of scaling operations is complicated due to the wide range of real numbers required in many applications and it depends on the algorithm and data. Furthermore, scaling is time consuming at both the application and design levels. Algorithms based on FP arithmetic do not have this problem. Then, porting programs from general purpose processors (with FP unit) to integer or fixed-point cores is a complex task. For instance, prototyped applications using Matlab have to be converted to fixed-point format. In those cases, an efficient solution may be to use a fast FP emulation layer over the integer or fixed-point units. This is the goal of our work.
The C library, presented in this paper, provides the basic five operations: addition, subtraction, multiplication, division and square-root for a quasi-fully compliant single-precision (SP) IEEE 754 FP format (the flags are not supported). But this library also provides some running modes with relaxed characteristics: no denormal numbers or restricted rounding modes for instance. A secondary goal of this work is to study the impact of the more or less full compliance to IEEE floating-point standard on software implementations. This library has been developed and validated within a collaboration with STMicroelectronics. The library has been targeted to the VLIW (very long instruction word) processor cores of the ST200 family. This paper is organized as follows. A summary on floating-point arithmetic is presented in Section 2. In Section 3, the target processor and compiler are presented. The new library is presented in Section 4. The validation of this new library is discussed in Section 5. The results of the implementation of some typical applications are presented in Section 6 as well as a comparison to the original library.
FLOATING-POINT ARITHMETIC
The FP standard defines the data format, some special values, the possible rounding modes, conversion rules, the behavior and the accuracy of the basic five operations (±, ×, /, √ ). In this section, we recall the main features of the single precision IEEE FP 754 standard 3 and the corresponding basic operations (see 4 for instance).
Format of Single Precision Floating-Point Numbers
A real number x is represented, in the floating-point format (the float data type in the C programming language for single precision numbers), using 3 fields: the sign s x , the exponent e x and the mantissa m x such that:
The sign s x is a 1-bit field: s x = 0 for positive values and s x = 1 for negative values. The exponent e x is represented using an 8-bit integer with −126 ≤ e x ≤ 127. The stored exponent is represented using a biased representation with bias b = 127, i.e. the stored exponent is an unsigned integer e x,b such that e x,b = e x + b. The minimum and maximum biased exponents, 0 and 255, are reserved for zero, the special values and the denormal numbers. The mantissa, also known as significand, is composed of an implicit leading bit and the fraction f x such that m x = 1.f x for normal numbers and m x = 0.f x for denormal numbers. In single precision, the fraction is a 23-bit field. The format is illustrated on Figure 1 . 
Representable Numbers and Special Values
Normal numbers are represented by x = (−1) sx × 1.f x × 2 ex with −126 ≤ e x ≤ 127. As the implicit leading bit is always 1, it is not stored and it is also called the hidden bit. For normal numbers, we have m x ∈ [1, 2).
The denormal, or subnormal, numbers are such that e x,b = 0 but the fraction is non-zero (otherwise it would be the zero value). In this case, the implicit leading bit is 0 and the exponent e x is set to e min = −126. Denormal numbers are represented by x = (−1)
Due to the implicit leading one in the mantissa, the zero value is not representable. Zero is thus the special value such that with e x,b = f x = 0. The unspecified sign bit allows two values for zero: −0 and +0. This is consistent with the following properties : 
Rounding
If x and y are representable numbers (machine numbers), then the result of the operation r = x y may not be representable in the target format (e.g. 1/3 in the decimal system). The result must be rounded. The standard provides four rounding modes:
• round to +∞, denoted by (r): returns the smallest machine number larger than or equal to the exact result r.
• round to −∞, denoted by ∇(r): returns the largest machine number smaller than or equal to the exact result r. • round to 0, denoted by Z(r): returns (r) for negative numbers and ∇(r) for positive numbers.
• round to nearest, denoted by •(r): returns the machine number that is the closest to the exact result r.
If r is the exact middle between two consecutive machine numbers, it returns the FP number whose least significant bit is a 0 (round to nearest even). This is the default rounding mode in practice.
The standard requires an important property: correct rounding. Let x and y be two machine numbers, an operation of {+, −, ×, ÷} and the active rounding mode among the 4 IEEE rounding modes. The result of (x y) must be (x th y): the returned result must be the result of the operation computed using an infinite accuracy (i.e. the theoretical operation th ) and then rounded. The standard requires the same property for the square root operation.
Conversion, Comparison and Behavior in IEEE Floating-Point Arithmetic
The standard defines conversion operations from integers towards floating-point numbers (FPN), from FPNs towards integers, from FPNs towards integers represented using a FPN, between FPNs of different formats and finally, between binary and decimal representations for input/output. The comparisons defined in the standard are: =, >, < and non-ordered. The standard specifies that "no floating-point operation should alter the behavior of the overall computation". Consequently, some flags are defined in order to inform the system about the behavior of some operations. Five flags are defined in the standard: invalid operation (the result is NaN), overflow (the result is ±∞ or the largest representable number depending on the rounding mode), underflow (the result is ±0 or a denormal number), division by zero (the result is ±∞) and inexact result (raised when an operation involves a non-exact result).
STMICROELECTRONICS VLIW PROCESSOR AND COMPILER
The ST200 family of high performance low power VLIW processor cores are targeted at STMicroelectronics system on chips (SOC) solutions for use in computationally intensive applications as a host or an audio or video processor. Such applications include embedded systems in consumer, digital TV and telecommunication markets.
ST200 Family of VLIW Processor Cores
The ST200 is a four issue VLIW processor and is significantly simpler and smaller than an equivalent four issue super-scalar processor. The compiler is key to generate, schedule and bundle operations, removing the need to add complex hardware to achieve the same results.
The ST200 design originates from the LX research project started as a joint development by Hewlett-Packard laboratories and STMicroelectronics 5 done by Josh Fisher and his team for printer imaging applications. STMicroelectronics purchased the rights to develop the LX architecture 6 and to create the ST200 processor family. to 32-bit by 32-bit integer and fractional multiplication on the ST231 core. There is no divide instruction but a division step instruction suitable for 32-bit division. In order to reduce the use of conditional branches, the ST200 processor also provides conditional selection instructions.
The ST200 features a clean, compiler-friendly 6-stage instruction pipeline depicted in Figure 3 , where all operations are fully pipelined. Almost all operation latencies are 1 cycle, meaning that the result of an operation can be used in the next bundle. Some instructions have a 3-cycle latency (load, multiply, compare to branch) or in one case a 4-cycle latency (load link register LR to call). Starting from the ST230 member of the ST200 family, latencies are enforced with interlocks, that enable some important code size reduction by removing the nops or their equivalent that may be otherwise needed to guarantee a correct schedule. 
The ST200 Compiler
STMicroelectronics developed the st200cc compiler based on the Open64 technology 7 and the LAO code optimizer, 8 in order to replace the Multiflow Trace Scheduling compiler 9 originally used on the LX processor. The st200cc compiler was released in early 2003 and is currently actively developed to support new ST200 cores and advanced optimizations.
The STMicroelectronics ST200 production compiler st200cc is based on the GNU gcc and g++ compiler front-end components, the Open64 compiler technology, 7 a global instruction scheduler adapted from the ST100 LAO code generator, 8 an instruction cache optimizer 10 including support for dead code elimination (DCE), dead data elimination (DDE) and the GNU assembler, linker and binary utilities. The Open64 compiler has been retargeted from the Intel IA64 to the STMicroelectronics ST200, except for the instruction predication, the global code motion (GCM), the instruction scheduler and the software pipeliner (SWP). Indeed these components appeared to be too dependent on the IA64 architecture predicated execution model and on its support of modulo scheduling through rotating register files.
In the st200cc compiler, the Open64 GCM and SWP components are replaced by the ST200 LAO instruction scheduler / software pipeliner, which is activated at optimization level -O3. At optimization level -O2 and below, only the Open64 basic block instruction scheduler is active.
In addition, access to integer and DSP arithmetic from C and C++ is supported by an extensive set of intrinsic functions. The functions defined in the ST200 Application Programming Interface provide access to low-level features, such as ST200 leading zero count instruction and to medium level features, such as the standard ETSI
11
and ITU 12 basic operators. All intrinsic functions are especially optimized by C and C++ compilers and are also delivered as standard C models for application validation on a workstation.
The intrinsic functions support has been fully developed in assembly language, validated using their C semantic models and a Data Generation Language 
Initial Floating-Point Library Derived from SoftFloat
The floating-point support is a derived work from John Hauser's SoftFloat package, 14 using the 64-bit variant of the IEEE754 floating-point implementation, but keeping only one rounding mode (round to nearest even), removing denormal support and using specific intrinsic functions such as count leading zero to gain speed. 
IMPLEMENTED LIBRARY

Phase B: Handling special values.
If an operand is a special value such as ±∞, ±0 or NaN, the result of the operation is known in advance and thus requires no numerical computation. Consider for example the case of multiplication, without denormals. (The same remark applies to other operations, with or without denormals.)
The table below displays all the possible values of the product (x × y) when x or y is a special value.
x, y
NaN +∞ −∞ +0 −0 other NaN NaN NaN NaN NaN NaN NaN +∞ NaN +∞ −∞ NaN NaN ±∞ −∞ NaN −∞ +∞ NaN NaN ±∞ +0 NaN NaN NaN
A first implementation of the above table can be a sequence of four "if-then-else" tests on the values of e x,b and e y,b (pseudo-code on the left). However, one can check that {e x,b , e y,b } ∩ {0, 255} = ∅ with only one "if-then-else" test after a "max" computation (pseudo-code on the right). This follows from the fact that testing the equivalent condition {e 
i . We see that condition {e x,b , e y,b }∩{0, 255} = ∅ is equivalent to condition max(E x , E y ) = 254. Hence the pseudo-code on the right. The goal here is to filter all the special values cases in only one test (and costly jump) since those values are extremely rare. On the target processor the min, max operations are executed in only one cycle. In the case of denormal input numbers, the preceeding steps can be easily adapted. Those cases were filtered in phase B and dedicated code performs the computation in this case. Even if the input operands are normal numbers, a denormal result can occur during the last part of the computation, but its treatment is quite simple.
In the next four Sections (4.1 -4.4), we will quickly describe the algorithms for each operation in this phase. 
Floating-Point Addition and Subtraction
For operands x, y encoded by X, Y as in 2.1, r = (x − y) is generated using an addition after complementing the first bit of Y. Hence we now focus on efficient implementation of FP addition r = (x + y) only. First, e r,b is guessed to be E = max(e x,b , e y,b ) and the exponent difference e x,b − e y,b is computed as a 32-bit signed integer D. Then, from the last 26 bits of X << 3 and Y << 3, mantissas MX, MY are built up as 32-bit signed integers in two's complement representation; if D < 0, these mantissas are further interchanged in order to reduce to the case where e x ≥ e y .
Alignment of the two mantissas is done by right shifting MY by m = min(26, |D|) positions; to prepare for rounding, the last bit of this new aligned mantissa is further set to 1 if at least one of the last m bits of MY is nonzero (one part of the sticky bit computation). Mantissas can now be added together: the first bit of the sum gives s r and its absolute value M gives a first (possibly unormalized) guess for m r . Mantissa normalization follows by left or right shifting M, depending on the number of the leading zeroes of M and by updating the last bit of the shifted M (as it was done for MY). Then e r,b is obtained by possibly incrementing E by one according to the active rounding mode , the values of the last three bits of M and the sign s r . The rounding of M is done in a similar fashion. The efficiency of such an implementation of FP addition is shown in Table 2 ; the main reasons for it are:
1. the approach of phase B (Section 4) for detecting special values; 2. the use of the MIN, MAX functions rather than more costly classic "if-then-else" tests; 3. the use of a dedicated processor "leading zero count" instruction available in the ST200 family. Table 2 reports the timing in number of bundles for the addition operation. This is the value of the average case timing (i.e. no special values and no over/under-flow). Two versions of the new library are compared to the original library. The first version is without denormal numbers (wo/D) and with rounding mode set to rounding to nearest even. The second version is with denormal numbers (w/D). 
Floating-Point Multiplication
The product of the two mantissas MX × MY is performed using the standard 4 partial product decomposition. Decompositions using only 3 partial product have been investigated without success. Indeed, those decompositions reduce the number of intermediate products but not the total latency and furthermore they require additional extraction instructions before the product. Table 3 . Main characteristics of the multiplication operation.
Floating-Point Division
We start by taking s r = XOR(s x , s y ) and e r,b = e x,b + e y,b − 127 and by left shifting the mantissa of y by one position. In the normalized case, 2 −1 < m x /m y < 2 and we shall compute the first binary digits of m x /(2m y ) = 0.q 1 q 2 q 3 · · · . More precisely, if m x < m y then q 1 q 2 = 01 and one needs the next 24 bits q 3 · · · q 25 q 26 , q 26 being the guard bit used for rounding; if m x ≥ m y then q 1 = 1 and then only q 2 · · · q 24 q 25 are needed, with q 25 as a guard bit. In both cases, each digit q i is computed iteratively by a classical non-restoring division step in base 2. 4 The number of iterations is thus either 25 or 26 and the last iteration is followed by the usual correction step in case of a negative remainder. Note that in the case of 26 iterations, normalization of the result mantissa m r implies that the result biased exponent e r,b is decreased by one. The remainder is then used to update the value of the sticky bit, and rounding of the mantissa can be done using both the guard and sticky bits.
In the case of denormals, we add the necessary pre-treatment and post-treatment operations. Again, special values are handled efficiently as described in phase B (Section 4). 
Floating-Point Square Root
For r = √ x with r, x as in Section 2.1, one takes s r = s x and e r,b = (e x,b + 127)/2 . If e x,b is even (that is, e x is odd), we replace m x with 2m x . Hence, for a normalized x, this new m x satisfies 1 ≤ m x < 4 and thus in any case 1 ≤ √ m x < 2. We then compute the first 24 fractional bits of √ m x = 1.q 1 q 2 · · · q 23 q 24 · · · (with q 24 as a guard bit) by a standard non-restoring iterative scheme. 4 As for division, these iterations are followed by a correction step based on whether the remainder is zero or not. This remainder further gives the value of the sticky bit which is updated before rounding the computed mantissa.
Notice that since square root is a unary operator, handling special values is simpler than for other binary operators; it can be implemented efficiently as follows. Let x be an operand encoded as the 32-bit unsigned integer X of Section 2.1. If X > 2 31 then a NaN is returned, because in this case x is nonzero, with negative sign. Otherwise, one detects whether x ∈ {±0, +∞, NaN} as before by testing if e x,b ∈ {0, 255}; in this case, x is returned. Table 5 reports the timing, code size and bundle fill rate for the square-root operation.
VALIDATION
Our library has been validated using two levels of test. The first level validates the individual operations. It is based of long vectors of chosen patterns for each operation. A test pattern is composed of the chosen argument(s) and the expected result of the operation. The operation passes the vector test if for each pattern the computed result is the expected value. The patterns are chosen to test all the branches in the algorithm as well as the In Figure 5 , a 11−14% speed improvement is achieved for the new library without denormal numbers (wo/D). For the new version with denormal numbers (w/D), the improvement is about 5 − 9%.
We also measure the code size of all the algorithms and for the different versions of the library. The impact of the new libraries on the complete code size is very small. In the case of the new library without denormal numbers, an average increase of the code size of about 0.3% is measured (with respect to the original library). The increase is about 0.8% (up to 1.4% in some cases) for the new version with denormal numbers. This shows that a significant increase of the code size for the individual operations has only a very small impact on the complete code size. Indeed, those functions are just inserted in only one place in the complete code. This should not lead to worse instruction cache performances.
CONCLUSION & FUTURE PROSPECTS
In this paper, we have presented a new C library for single precision IEEE floating-point for processors without FP hardware units such as VLIW or DSP processor cores for embedded applications. This library provides the basic arithmetic operations (addition, subtraction, multiplication, division and square root) and quasi-fully complies with the IEEE 754 standard (the flags are not supported). In particular, this library offers the four rounding modes and supports denormal numbers. The library has been optimized and tested on STMicroelectronics processors of the ST200 family.
Also, the library has been validated at two different levels. Special values, specific numerical values and random values are used in the first level of validation. The cases involving special values such as ±0, ±∞ or NaN have been tested exhaustively. Second, the library has been used successfully as the underlying arithmetic layer when running various numerical algorithms with known results.
The performances of the new library have been compared to those of the original library. The conclusions of the simulations done with this architecture are:
