Abstract
Introduction
The increasing need for high-performance embedded applications is forcing system developers to spend more time in code optimization. In addition, there is also need for low-cost and power consumption constraints of embedded applications [1] .
Although fixed-point architectures have better cost and, especially, lower power consumption rates, they do not have the dynamic range and hardware efficiency of a floating-point architecture, which restrict them to emulate floating-point arithmetic via software [2, 3] .
These two extremes create an area in-between, where for some applications the choice of the architecture is not well-defined. These applications have the need for floating-point arithmetic, but not enough to justify a floating-point architecture [4] .
In this scenario, fixed-point architectures are becoming a feasible solution. With increasing clock speeds, these applications are able to achieve the performance demanded for floating-point arithmetic without having to migrate to a floating-point architecture [4] .
In order to extend the range of applications, studies are being conducted to improve performance on fixedpoint architectures [5, 6] , giving an opportunity to system developers to benefit from the use of low cost and low power consumption of a fixed-point architecture without losing the performance of their floating-point applications.
Well-known applications that take advantage of fixed-point architectures are filters, audio processing and video compression [7] . Another recent example is the Satellite University Program ITASAT [8] , which chose to use an ADSP-BF533 fixed-point processor in its board computer emulator instead of a floating-point. This paper presents a method for implementing high-performance math functions through polynomial approximation on the Blackfin ADSP-BF533. First, we present how to manipulate IEEE-754 floating-point format on a fixed-point architecture, followed by an explanation on how to approximate math functions like sine, exponential and logarithmic using polynomials.
We end this paper proposing an alternative to overcome the Blackfin C library performance, based on a fast floating-point emulation and fast polynomial evaluation schemes, and presenting the results obtained.
Blackfin ADSP-BF5xx processor
The Blackfin ADSP-BF5xx is a 16/32-bit fixedpoint digital signal processor (DSP) with 16 registers of 16-bit that can be manipulated as 8 registers of 32-bit. It has a dual 16-bit multiply and accumulate unit (MAC) and an easy-to-use RISC-like instruction set [9] .
The processor architecture is based on a technology jointly developed between Analog Devices and Intel, called Micro-Signal Architecture (MSA) [10] , which integrates signal-processing performance with micro-controller programming model, allowing a flexible resource allocation between hard real-time signal processing tasks and non real-time control tasks. Figure 1 shows the core of the Blackfin ADSP-BF533 processor. It is composed of an address arithmetic unit, a control unit and a data arithmetic unit.
Blackfin processors are designed to meet computational demands and power constrains of embedded audio, video and communication applications. It can achieve up to 756MHz of clock speed with a 10-stage pipeline in a modified Harvard architecture with a power consumption that ranges from 24 mW at 100 MHz and 0.8 volts, to 644 mW at 756 MHz and 1.4 volts [9, 11] .
Floating-point emulation

IEEE-754 floating-point format
The Institute of Electrical and Electronics Engineers developed a floating-point standard known as "IEEE 754-1985 Standard for Binary Floating-Point Arithmetic" [12] defining the format, precision, values and operations over floating-point values.
It defines two types of precision for floating-points: single-precision and double-precision. Our discussion will focus only on single-precision floats.
This format is similar to a scientific notation, based on a sign, mantissa and exponent (1) .
Where 'b' is the base used, 'd' represents the digits of the number, ranging from 0 to 'b' -1, 'exp' is the exponent and 'sign' is a sign-magnitude notation.
In single-precision, 32 bits are used to store floating-point values. Using a binary base, there is one bit for sign (sign-magnitude), 23 bits for the mantissa (or fraction) and 8 bits for a signed exponent (2's complement) with a bias of 127 units [13] .
This format allows very large and very small numbers to be represented, provided that the radix point is not fixed, giving a wide dynamic range and higher precision to the format.
In fixed-point arithmetic, the radix point is always at the same location, limiting the magnitude and precision, but simplifying numeric operation and saving memory.
To improve precision and simplify magnitude comparison, floating-point values are normalized, meaning that no redundant sign bits are found in the mantissa. Thus, there is no need to store the radix bit "1.", called "hidden bit", which improves precision and since all values are aligned at radix point, comparisons are done through exponent verification.
The IEEE-754 also describes the behavior of undefined operations, like division by zero, and special values, like infinite and not a number (NaN) [14] .
Emulating floating-point operations in fixed-point architectures
Because there is no specific hardware to implement floating-point operations, fixed-point architectures need to emulate these operations via software.
To achieve that efficiently, it is necessary to separate the IEEE-754 floating-point format into two words, one for the exponent and one for the signed mantissa [15] . This is possible, because DSP's are able to represent fractionary numbers, offering formats like 1.31 or 1.15 [9] , that can represent values between the range [-1,1) in a 2's complement notation, making them suitable to emulate floating-point values.
In order to ease the manipulation of these values in the Blackfin 16/32-bit native architecture, a 16-bit register is used to store the exponent and one 32-bit register to store the signed mantissa. The result of this is an 8-bit extended precision, improving the accuracy of intermediate floating-point operations.
It is important, during conversion, to insert the "hidden bit" from the IEEE-754 into the emulated format and to unbias the exponent before setting the two-word emulated floating-point value, to ensure the correct floating-point emulation.
Once the IEEE-754 floating-point sign, mantissa and exponent are extracted and stored in two separate registers, it is possible to implement the floating-point basic operations (add, subtract, multiply and divide) using only fixed-point logic.
Blackfin instruction set has two very useful instructions for this purpose, SIGNBITS and ASHIFT, where the first one computes the number of zeros before the sign bit (i.e. the exponent), and the second one operates an arithmetic shift to normalize the mantissa [16] .
The inverse steps are used to convert the emulated floating-point value back to the IEEE-754 floatingpoint format.
Polynomial approximation
Polynomial approximation consists of determining a set of coefficients for a polynomial of n th order to reconstruct a continuous function f(x) over its domain.
Since processors hardly ever implement complex functions like sine, for instance, via hardware, it is necessary to approximate these functions using only simple operations, like addition and multiplication [17] .
Therefore, a natural choice for computing these functions is through polynomial approximation, which is represented as a finite number of add/multiply operations (2) .
P(x) is a single variable polynomial of 'm' th order, with 'm'+1 'c' coefficients.
Several well-known methods are used to generate these coefficients, such as Lagrange, Hermite, Newton, etc. One of the simplest methods is the MacLaurin Series, a special case of the Taylor Series, which is a series expansion around zero.
Every finite polynomial approximation introduces error to the response. This error may increase or decrease depending on the polynomial order (the higher the order, the lower the error).
The relative error can be measured according to equation (3) .
MacLaurin Series are very imprecise for values near to the lower and upper limits of the approximation, demanding very high order polynomials to approximate functions precisely as the input value moves away from zero.
To keep the approximation error as low as possible and to reduce the polynomial order, a method called minimax is used, where the maximum relative error of the approximation is minimized over the domain range of the function. This method is based on Chebyshev's theory [18, 19] and is accomplished by the Remez Exchange Algorithm (also used in digital filters design [20] ).
Through this method, it is possible to generate efficient polynomial approximations with minimum order and reduced relative error.
To increase even more the precision of the approximation, it is common to reduce the natural domain of a function into a smaller range. This technique is known as range reduction [21, 22] and it is used not only to reduce the input value, but also to limit the minimax polynomial approximation domain. Thus, a large function domain, like the sine function, may be reduced to values between [0, π/4].
Therefore, two steps are needed to approximate a continuous function in a computer environment: 1) reduce the range of the input argument and 2) evaluate the polynomial.
In this way, a complex function can be evaluated efficiently based only in a set of simple operations, like multiplication and addition, allowing computer architectures to implement such functions.
Sine function
The sine function has an input domain that ranges from (-∞, +∞).
In order to approximate the sine function efficiently, due to its infinite input domain, it is necessary to reduce the range of the input value, like [0, π/4] for instance (this can be accomplished through trigonometric equalities). To reduce the range of the input argument, equation (4) is used [23] .
Where 'n' is the nearest integer less than or equal to x/C, 'x' is the input value, 'C' is the range reduction factor and it is equal to π/4, in this case. The final reduced argument or residue is given by 'r'.
A 7 th order minimax polynomial is generated through the use of Maple software over the reduced input range, leading to a relative error of 3.8 x 10 -9 .
Logarithmic function
The natural logarithm has an input domain that ranges from (0, +∞).
Since the input value is represented in sign/mantissa/exponent notation, the following range reduction can be performed (5). exp ln( ) ln( * 2 ) ln( ) exp*ln (2) x m m = = + (5) Where 'm' is the mantissa and 'exp' is the exponent.
Take into account that the mantissa is a positive value, implying 0.5 ≤ m < 1. Since the approximation polynomial is based on the MacLaurin Series (6), once again the mantissa is adjusted to a final range between [-0.5, 0]. th order minimax polynomial was generated over the range [-0.5, 0] with a relative error of 2.3 x 10 -8 . Unfortunately, the logarithm polynomial is a sequential power series, i.e. all powers from 1 to 9 are present, unlike the sine function polynomial that has only odd powers (using fewer multiply/add operations to evaluate the polynomial).
To improve the performance of computation by reducing the polynomial order by half, we chose to use 5 distinct 4 th order minimax polynomials, generated for each of the following input intervals [-0. This technique is called piecewise polynomial and consists in approximating a continuous function through several distinct low order polynomials, one for each interval piece [24] .
This gives a relative error of 5. Considerations about the implementation of this function will be presented in Section 6.2.
Exponential function
The exponential function has an input domain between (-∞, +∞).
To reduce the input argument, the same method applied to the sine function is used. Assuming C = ln(2)/2 k , where 'k' is an integer and determines the factor of reduction, the domain of this function will be reduced to a value between [0, ln(2)/2 k ] [25] . The exponential function may be expressed according to (7) .
e e e + = = (7) Where 'x' is the input value, 'r' is the residue, 'n' is an integer value (calculated according to equation (4)) and 'k' is the reduction factor.
To generate a low order polynomial, we chose k = 6. Considerations on this choice and implementation aspects will be given in Section 6.2.
A 2 nd order minimax polynomial was generated with an input interval between [0, 0.011], with relative error of 2.2 x 10 -8 .
Error analysis
The total error of computation or the maximum error is the accumulation of 3 types of error [26] : 1) function approximation error, 2) intermediate computation error and 3) final result rounding error.
The function approximation error of each function was presented in the previous section. Since emulated floating-point arithmetic in the Blackfin architecture provides an 8-bit extended precision, the intermediate computation error will be less then 2 -25 . The final result rounding error is caused by rounding the 32-bit emulated mantissa to a 24-bit IEEE-754 mantissa and will happen only one time (see Section 6.1) during function calculation. The rounding mode used is round-to-even and it is bounded by an error of 2 -24 .
The maximum error of each function is illustrated by Table 1 . For the logarithm function, the greatest error of the 5 sub-intervals is used. The precision of each function can be expressed in terms of number of bits according to the following relation (8).
ln( ) ln(2) 24 bits for mantissa, the precision of the functions is enough to guarantee an error of one unit in the last place (ULP), i.e., 2 -23 , for all implemented functions, except for the logarithmic function in two distinct intervals of its input value.
Optimized math functions
The Blackfin's math library is implemented in C language, based on the algorithms presented by [27] . Due to constant change between IEEE-754 floatingpoint format and emulated floating-point format a very inefficient code is generated by the compiler.
This constant change between formats causes an overhead in computation that prohibits the library to perform at higher speeds.
Assuming the following C code (9) , where the data type of variables A, B and C is single-precision floating-point:
A = A + B; C = B + 2.0; (9) D = A + C;
The Blackfin VisualDSP++4.0 compiler [28] will generate in the first line 2 conversions from IEEE-754 to an emulated floating-point and one conversion back to IEEE-754. The same will happen in the next two lines, which are very inefficient and time consuming, since variable A, B and C change formats too many times.
Due to a limitation of the compiler, instead of maintaining these values in the emulated form until all computation is done, it keeps constantly changing between formats [6] . This problem becomes even worse when evaluating a polynomial of 7 th order, containing a series of MAC (multiply and accumulate) instructions.
Since we are dealing with a fixed-point architecture, there is also the problem that every basic floating-point operation will be translated to a function call, which also is time consuming.
Furthermore, there is the lack of more aggressive polynomial approximation and evaluation schemes in the Blackfin math library.
While the Blackfin math library is based only in rational polynomial approximation p(x)/q(x) (where p(x) and q(x) are polynomials), which degrades performance due to the use of two polynomials and a floating-point division operation [22] , other optimized strategies to approximate and evaluate polynomials are ignored, like using single polynomials and look-up tables. The result of this approach is that Blackfin's C library operates at a lower rate.
Solution proposal
To overcome these limitations, we propose the use of mixed C and assembly code, which will prevent the constant format conversions, maintaining the math function input value under the emulated format until all computation is done, thus reducing the number of function calls and floating-point conversions.
In this way, the input value is converted two times only, one at the entry of the function, where the value is converted from IEEE-754 to an emulated floatingpoint format, and one at the exit of the function, where the emulated value is converted back to the IEEE-754 format.
To approximate and evaluate polynomials with higher speeds, we propose the use of a single polynomial for every function presented instead of using rational approximation, consequently avoiding the floating-point division operation which is 6.3 times slower than a floating-point MAC operation in the Blackfin architecture, measured empirically through simulation.
Other improvements suggested are the use of piecewise polynomials for the logarithmic function and the use of look-up table for both, logarithmic and exponential function (Section 6.2 will comment on that).
The use of hybrid methods, such as look-up table mixed with polynomial approximation, can reduce significantly the computational effort, thus improving the system performance over the standard Blackfin's math library.
The dataflow of our solution is illustrated according to Figure 2 .
Figure 2. Dataflow of the solution proposed.
As mentioned before, first the input value will be converted to an emulated floating-point format for manipulation (Float to Fix Conversion). The emulated floating-point will be then reduced before applying to the desired polynomial for function approximation (Range Reduction). With a reduced argument, the polynomial is evaluated (Polynomial Evaluation). The result of the polynomial approximation is converted back to the IEEE-754 floating-point format (Fix to Float Conversion) (Figure 2) .
Arranging all the improvements mentioned above, the following contributions will be made: first we have the reduction of format conversions and the reduction of function calls impacts significantly in the overall performance. Second, we can optimize function approximation and evaluation generating lower order single polynomials through more aggressive implementations and avoiding an expensive floatingpoint division. And third, we optimize pipeline occupation, avoiding data dependencies and getting better use of available resources in this architecture, such as the use of all register file (since C compiled code uses only half of the registers) [28] .
About the implementation
To implement and evaluate polynomials efficiently, we use Horner's scheme [26] , where an n th order polynomial is converted to a series of MAC instructions.
Unfortunately, since there is no hardware support for add/multiply floating-points, each MAC "instruction" turns into a raw call to an emulated floating-point multiplication function and to one addition function.
Another consideration regarding the implementation is the use of semi-table based solution for the logarithm and exponential function. For more information on table-driven algorithms see [25, 29, 30] .
For the logarithm function, the second term of the equation (5) is evaluated by look-up table. With an 8-bit exponent, 256 entries are utilized. Since the emulated format occupies 6 bytes, 2 bytes to store the exponent and 4 bytes to store the mantissa, this would require a total of 1.5KB of memory space.
In fact, only 254 entries are needed, since that the exponent values -127 and 128 represent special values (-∞ and +∞).
The choice of utilizing 5 sub-intervals is based on experimental results. The factors that influenced this choice are the maximum error of each interval and the increase of code size when dealing with various subintervals.
Every time a sub-interval is added, an if-else condition is also added to the code and needs to be executed in order to verify the interval in which the input value is in. Increasing too much the number of sub-intervals can degrade performance, once if-else conditions cause pipeline execution to halt.
The exponential function also uses a semi-table based solution, but with fewer entries than the logarithmic function. For k = 6, it needs 64 entries to handle the modulus of n/2 k in the power of 2 term (see equation (7)), resulting in 384 bytes of memory. Values above 6 increase table size without significantly reducing the polynomial order.
When the input value is negative, the value of 'n' is decremented after the evaluation of expression (4), so that the modulus operation is compensated. Therefore, there is no need for storing values of negative exponents, which limits table size to 2 k . To calculate the integer division result of the exponent, a shift of 6 times to the left is performed, turning this divide operation extremely fast. The remainder of the division, as mentioned, is treated separately through look-up table.
The choice of the reducing factor 'k', like for the logarithmic function, is also based on experimental results.
Results
The results were obtained through simulation using the VisualDSP++4.0 development environment. Results were measured in clock cycles, counting from function call to function return. No cache memories were used. The same input values were applied to both implementations, optimized and Blackfin's math library.
Minimum values were obtained when no range reduction technique was applied to the function input argument. The maximum time was obtained when range reduction techniques were used to reduce the function's input argument. Table 3 illustrates the results. Among the optimized functions, the exponential function presented the lowest number of clock cycles, with 331, while the sine function has the greatest count, with 668.
The exponential function also presented the highest time (cycle) reduction of this work with 85% over its math library counterpart or equivalently, a performance improvement of 6.6 times when no range reduction technique was applied.
The reason behind the overall improvement is the reduction in the number of conversions between formats and the use of aggressive range reduction and fast polynomial evaluation schemes, which reduce the order of the polynomial employed to approximate functions and also reduce the number of computations to produce the final result.
The worst performance improvement belongs to the sine function. Both solutions, optimized and Blackfin's math library use the same technique to reduce the range of the argument and similar polynomial approximation (this is the only function in the Blackfin math library that uses a single odd-order polynomial instead of a rational approximation).
While the solution here proposed reduces the input argument between the interval [0, π/4] and uses a 7 th order polynomial, the Blackfin math library reduces to a range between [0, π/2] with an 11 th order polynomial. Therefore, the improvement made comes mostly from the reduction of format conversions, since there is only a difference of 2 MAC operations between these solutions.
On the other hand, the exponential function performance improvement is achieved through fast polynomial approximation and evaluation.
While the Blackfin exponential function uses a range from (-ln(2)/2, ln(2)/2) with a rational approximation of 3 rd order, the optimized exponential function only employs a single 2 nd order polynomial, avoiding the division operation, with a range low as [0, 0.01].
Another contribution of this work is the minor time variability of the math functions implemented. Notice the time difference between the maximum and minimum values. While Blackfin's math library has significant time differences, depending on the input value, this work presents smaller differences.
Finally, we could compare our results with the results presented by [31] . Although their work was on a floating-point processor, it is possible to notice that through optimization we can reduce the gap between fixed-point architectures and floating-point architectures, giving system designers the option of choosing a lower cost processor without putting at risk system performance, as mentioned in Section 1 of this paper.
Taking a closer look to the exponential cycle count of the TI's math library presented by [31] and the one obtained here, we have 213 cycles for TI's TMS320C67x floating-point processor and 331 cycles (best time) for the optimized ADSP-BF533 fixed-point processor. Now, considering TI's processor running at 200 MHz and Blackfin's processor at 756 MHz, this would result in a total execution time of 1.06 µs and 0.44 µs for each processor.
Continuing our discussion, if we compare power consumption, based on the benchmarks performed by [11] , Blackfin has a power consumption of 644 mW at 756 MHz and 1.4 V, while the TMS320C6713 consumes 694 mW at 200 MHz and 1.2 V.
Conclusion
This paper presented the implementation of an optimized math library for the fixed-point Blackfin ADSP-BF533 processor.
Performance improvements were obtained due to the use of a fast emulated floating-point format with reduced number of conversions between floating-point formats and due to the use of optimized polynomial approximation schemes implemented with mixed C/assembly code.
The results reflected on a time reduction that varies from 66% and up to 85% compared to the Blackfin math library for the sine, logarithm and exponential functions.
