Abstract
Introduction

Powering (
) is an important operation in applications such as scientific computing, digital signal processing (DSP) and computer 3D graphics [7] . As other elementary functions, such as square root, inverse square root, logarithm and trigonometric functions, it has been traditionally computed by software routines [2, 6] . These routines provide very accurate results, but they are often too slow for numerically intensive or real-time applications. The timing constraints of these applications have led to the development of dedicated hardware for the computation of elementary functions, including the implementation of table-based algorithms [9, 15] , functional iteration methods [10, 11] and digit-recurrence algorithms [1, 4] .
However, accurately computing the floating-point powering function is considered difficult [9] , and the prohibitive hardware requirements of a table-based implementation (note that is a 2-variable function) have led only to partial solutions, such as powering algorithms for a constant exponent [12, 16] or for very low precision [7] . A direct implementation of a digit-recurrence algorithm for powering computation is not feasible due to its high intrinsecal complexity.
In this paper we present a composite iterative algorithm for the computation of the powering function ( ), for a floating-point input operand Å Ü ¾ Ü and integer -bit operand ÐÒ´Å Ü µ by using a high-radix recurrence with selection by rounding [13] . An intermediate computation Ä ÐÒ´Å Ü µ, with Ä ÐÓ ¾´ µ, is carried out using a high-radix left-to-right carry-free (LRCF) multiplication [3] . Another LRCF multiplication by ÐÒ ¾ is performed to guarantee the convergence of the algorithm and, as last step, the exponential of the resulting product is computed by an on-line highradix (Ö ¾ ) algorithm, with on-line delay AE ¾ [14] .
In the stages computing the logarithm and the exponential, selection by table look-up is performed in the first iteration to guarantee the convergence of both algorithms. However, the on-line delay of 2 cycles allows the addressing of the initial tables in the exponential scheme one cycle in advance, reducing the delay of the critical path in this stage.
A sequential architecture implementing our algorithm with radix Ö ½¾ is proposed in this paper, and estimates of the total area and execution time for single and doubleprecision floating-point computations are obtained according to an approximate model for the area and delay of the main components used in the proposed architecture. 
Algorithm
In this section we present a new composite iterative algorithm for the computation of the powering function ( ), for a floating point operand and an integer exponent , describing the algorithm and its error analysis.
Overview
The algorithm is based on the well-known identity
ÐÒ´ µ
Considering a floating-point 2 input operand Å Ü ¾ Ü , with Å Ü the Ò-bit significand and Ü the exponent, and an integer -bit input operand :
According to (1) , the powering function can be calculated by a sequence of operations consisting of the logarithm of the significand Å Ü , a multiplication by and the exponential of the resulting product. The form of the result is a significand ÐÒ´ÅÜ µ and an exponent Ü , typical of a floating-point operand 3 . For an efficient implementation of the powering function, the computation of the operations involved must be overlapped, which requires a left-to-right most-significant digit first (MSDF) mode of operation and the use of a redundant number system.
A problem of the proposed algorithm is the range of the digit-recurrence exponential, which is´ ÐÒ ¾ ÐÒ ¾µ, while 2 The algorithm can be adapted to a normalized fixed-point input operand ÅÜ, ½ ÅÜ ¾, resulting in 4. Computing the exponential Å Þ ÐÒ ¾ , by employing an on-line high-radix algorithm [14] .
The exponent Þ can be computed in parallel by an integer multiply/add unit. The output of the algorithm is the floating-point normalized result:
The use of redundancy results in 
Error analysis
The final error in the algorithm consists of the accumulation of the errors due to the cascaded implementation of a set of operations, and must be bounded by ¾ Ò .
Let¯Ð be the error in the computation of the logarithm of the input significand (ÐÒ´Å Ü µ · Ð ), and¯Ñ ¼ the error due to the finite arithmetic in the computation of the product Ä ( ÐÓ ¾´ µ · Ñ¼ ). When the first LRCF multiplication is performed, the associated error is: The difference between the exact result and the obtained result must be bounded:
For a precision of Ò bits and a radix Ö ¾ , a set of minimum values for¯Ð,¯Ñ ¼ ,¯Ñ ½ ,¯Ñ ¾ and¯ must be determined to guarantee a final result accurate to Ò bits. The values of the error parameters set the precision to be reached in each stage, Ò Ð , Ò , Ò Ñ½ , Ò Ñ¾ , and the accuracy Ò Ñ¼ of the product ÐÓ ¾´ µ. These parameters determine a minimum number of iterations of the logarithm (AE Ð ) and the exponential (AE ) to be performed. As shown in Figure 1 , the number of iterations of the LRCF multiplications to be performed is the same as those of the logarithm, AE Ð , because all the information must reach the exponential stage. The parameters AE Ð , AE , Ò Ð , Ò , Ò Ñ½ , Ò Ñ¾ and Ò Ñ¼ set the size of the look-up tables, adders and multipliers to be used. Moreover, Ð , , Ñ½ and Ñ¾ guard bits must be employed to guarantee that in each stage of the powering algorithm the iteration errors do not affect the achievement of the required precisions Ò Ð , Ò , Ò Ñ½ and Ò Ñ¾ .
The critical parameter to be first minimized is¯ , since it is directly related to the required precision Ò to be reached in the exponential stage, and therefore to AE , the number of iterations of the on-line exponential to be performed 4 . For radix Ö ½¾ , the set of minimum values for the considered parameters is shown in Table 1 for singleprecision (Ò ¾ ) and double-precision (Ò ¿ ) computations. AE and AE Ð are given in cycles, and Ò , Ò Ð , Ò Ñ¼ , Ò Ñ½ and Ò Ñ¾ are given in number of bits.
Implementation
In this section a sequential architecture is proposed for the implementation of our high-radix powering algorithm. We give a detailed explanation of the main computations involved: (i) high-radix logarithm, (ii) high-radix LRCF multiplication, and (iii) on-line high-radix exponential, and then outline the main features of the logic blocks employed. Figure 2 shows the block diagram of the proposed architecture. Single thick lines denote long-word (around Ò bits) operands/variables in parallel form, single thin lines denote short-word (up to 11 bits) operands/variables in parallel form, and double lines denote single-digit ( bits) variables (Ê , I, and À ). 4 Note that AE is the only parameter that affects the latency of the algorithm, since AE ¾ for any Ò and any Ö , as shown in [14] .
High-Radix Logarithm
A high-radix digit-recurrence logarithm is described in detail in [13] , althought in the algorithm used here some modifications and optimizations have been made according to the operation flow in the powering computation. A slightly different notation from the used in [13] is employed here. The block diagram of the high-radix logarithm stage is shown in Figure 3 .
The recurrences for performing the multiplicative normalization of the input operand Å Ü and computing the logarithm digits are
For a result precision of Ò Ð bits, a total number of AE Ð Ò Ð iterations are necessary. The scaled recurrence Ê has been defined as
in order to extract a radix-Ö digit Ê per iteration from the same bit-positions in all iterations. The selection of the digits Ð in iterations ¾ is done by rounding an estimate of the residual obtained by truncating Ï Ð to Ø fractional bits. The selection function is
The sign of the digit Ð is defined as opposite of the sign of Ï Ð in order to satisfy a bound on the residual, and thus most significant bits of the input operand Å Ü , and the selection is done in such a way that the value of Ð ¾ is bounded according to the convergence conditions [13] . However, this results in an over-redundant digit Ð ½ ( · ½ bits), increasing by one bit the size of the multiplier operand. The convergence conditions also determine a minimum value of Ø ¾ and the radix Ö . A multiply/add unit is used for the computation of the residual recurrence, unlike in the algorithm proposed in [13] , where a separated SD multiplier and SDA4 were used. The digit Ð ½ is stored in the look-up table Ì ´ÖÐ ½ µ already in SD-4 recoded form, to reduce the delay of the path containing the table and the multiply/add unit.
The logarithm constants are stored in a look-up table whose size grows exponentially with the radix. However, an approximation Ð Ö ½ can be used in iterations AE Ð ¾ · ½ in order to reduce the overall hardware requirements of the algorithm. The use of redundant representation is mandatory in our algorithm for powering computation, due to the left-to-right operation flow, and results in faster execution times by making the additions independent of the precision.
LRCF Multiplication
The left-to-right carry-free (LRCF) multiplication, introduced in [3] , produces the product digits from a redundant set in a most-significant-digit-first (MSDF) manner and performs the conversion on-the-fly of the product generated in a redundant form to the conventional form without using a carry-propagate adder and without additional delay. The resulting implementation is fast and regular and is very well suited for VLSI implementations.
We adapt the LRCF multiplication to carry out the intermediate multiplications in the high-radix powering, utilizing its left-to-right operation flow, with redundant representation of the operands and recoding to a high-radix of the multiplier operand. However, since the high-radix exponential is of the on-line type, the digits produced by LRCF multiplier are used without conversion. The block diagram of the two LRCF multiplication stages is shown in Figure 4 .
The adapted LRCF algorithm has two operands and . is the radix-2 representation of the multiplicand , such that The digits of the multiplier are used from most to least significant, unlike conventional multiplication schemes which use a right-to-left mode. After Ò Ñ steps, Ô Ò Ñ is the most significant part of the product while Û Ò Ñ is the least significant part. Regarding the implementation proposed in [3] , the main modification in the LRCF multiplication stages in the architecture for powering computation is the use of a multiply/add unit for computing the recurrences, instead of separated redundant multiplier and adder.
Proceedings of the 16th IEEE
On-Line High-Radix Exponential
The on-line high-radix exponential is described in detail in [14] , althought a slightly different notation is used here. The block diagram of the on-line high-radix exponential stage is shown in Figure 5 .
The exponential is computed on the basis of partial information:
with AE the on-line delay and À the radix-Ö digits of the input operand À.
The recurrences for performing the additive normaliza- 5 The range of Ã is Ã ¿Ö ¾. The use of selection by table look-up is required in the first iteration. The table is addressed by the most significant bits of the input operand À ½ , and the selection is performed so that ´Ö ¿µ ¾ ´Ö ¾µ, which results in an overredundant first digit ½ . The convergence conditions also determine a minimum value of Ø ¾ , an on-line delay AE ¾ , and a bound on the radix Ö .
The look-up tables used in the first iteration are addressed one cycle in advance, since À ¼ is already known while À AE·½ is being computed, and the obtained values can be stored in the registers Ö ½ and Ö ÐÒ ½ .
The conversion from redundant to conventional representation of the final result is performed using an on-the-fly method [4] . Table 2 . Execution time and total area for the proposed architecture (Ö ½ ¾ )
Implementation details
The main features of the proposed architecture are:
All variables are in redundant representation (signeddigit) to allow faster execution of iterations, since the additions become independent of the precision.
All products of the type É Ö and Õ Ö (with Õ Ð or ) are performed as shifts, since Ö ¾ .
SDA« is a signed-digit binary adder with « input bitvectors. The addition of two SD operands requires a SDA4 adder, while SDA3 adders can be used for accumulating a SD operand and an operand in two's complement (¾ ) representation.
An internal recoding from SD radix-2 to SD radix-4 representation of the multiplier operand is performed in the multiply/add units to reduce by half the number of partial products to be accumulated.
The round&rec units take the´ · Øµ most significant bits of the residuals Ï Ð and Ï (Ø ¾ in both cases) and produce a SD-4 representation of the digits Ð and , to be used as multipliers in the multiply/add units.
The round&assim units take the same´ · Øµ input words, but produce a two's complement representation of the coefficient with opposite sign ( Ð or ), by rounding the input word and performing its assimilation to non-redundant representation. This ¾ representation is used for addressing the look-up tables storing the logarithm constants and as a input for the barrel shifters producing Ð Ö ½ or Ö.
Evaluation and comparison
In this section we present estimates of the execution time and the area costs of the proposed architecture, for single and double-precision floating-point computations (Ò ¾ and Ò ¿ bits) and radix Ö ½¾ . These estimates are based on an approximate model for the cost and delay of the main logic blocks used [5, 11, 13] .
The actual delays and area cost depend on the technology used and on the actual implementation. However, this technology-independent model provides a good first-order approximation to the actual execution time and area values. Table 2 shows the execution time and area cost estimates of the proposed architecture for single and double-precision FP computations, with Ö ½ ¾ . The units employed are the delay of a complex gate, such a full-adder, and the area of a 1-bit full-adder ( ).
The latency, as explained in (4), is AE · cycles, since AE ¾ in this case, and therefore we have 10 and 15 cycles respectively for single and double-precision 6 . The cycle time of 10.0 corresponds in both cases to the high-radix logarithm stage, where the critical path is the one composed of the initial table look-up for selecting the digit Ð ½ , a multiplexer, the multiply/add unit and the register Ï Ð . The total execution times can therefore be estimated as 100 and 150 , as shown in Table 2 .
The total area of our architecture consists of the hardware requirements of the individual stages (high-radix logarithm, LRCF multiplications and on-line high-radix exponential, shown in Figures 3 to 5) , plus the integer ( ¢Ò ÜÔ )-bit multiply/add unit to compute the exponent Þ , an extra ( ¢ Ò Ñ¼ )-bit multiplier required to compute Ä ÐÓ ¾´ µ, the control logic, and a -bit CPA adder to assimilate À ½ into conventional representation before addressing the initial tables in the on-line high-radix exponential stage.
The total table size is 46Ã Ø× (5.75Ã ) and 161Ã Ø × (20Ã ), respectively, for single and double-precision computations, with a contribution to the total area, according to the model used, of 1673 and 5660 . The total area estimates, as shown in Table 2 , are 4138 and 10131 .
Comparison with high-radix CORDIC
For the sake of reference, we give now execution time and area estimates for a high-radix vectoring CORDIC implementation [1] . This algorithm has been chosen as a reference because it allows the computation of two-variable functions (the modulus of a 2D vector). To allow for a fair comparison, the same assumptions are made for both algorithms, a radix-½¾ implementation is considered for the CORDIC 7 , and we consider a fixed-point implementation of our powering algorithm for Ò ¿ ¾ . (taking 3 cycles due to the complexity of the second scaling), and 5 extra iterations for compensating the scaling factor. This results in a latency of 12 cycles ( · ¿ · , since the first extra iteration can be overlapped with the last microrotation). According to the approximate model used, the critical path has a delay of 10 , and therefore the execution time can be estimated as 120 . The total area of this implementation is about 6000 , with a table size of ½ ¾Ã .
We can conclude that with our algorithm, the powering function can be computed efficiently in dedicated hardware with cost and performance comparable to those of the evaluation of other elementary functions.
Summary
A new composite algorithm for the computation of the powering function has been presented in this paper. The algorithm consists of computing as ÐÒ´ÅÜ µ ¾ Ü , through a sequence of overlapped operations: logarithm, multiplication and exponential. The computation of the logarithm is carried out by a high-radix digit-recurrence unit, with selection by rounding. The intermediate multiplications are computed by high-radix left-to-right carry-free (LRCF) multipliers. Finally, the exponential is computed by an on-line high-radix unit with selection by rounding. The computation of the exponent of the result is carried out in parallel by an integer multiply/add unit.
A sequential architecture implementing our algorithm has been proposed. Estimates of the execution times and hardware requirements have been obtained, based on an approximate model for the delay and the area of the main components, for single and double-precision floating-point computations with radix Ö ½ ¾ .
A comparison with a high-radix CORDIC architecture shows that the execution times and area costs of the unit computing the powering are similar to those of other iterative algorithms for the computation of elementary functions.
