Abstract-We present a fixed point architecture (source VHDL code is provided) for powering computation. The fully customized architecture, based on the expanded hyperbolic CORDIC algorithm, allows for design space exploration to establish tradeoffs among design parameters (numerical format, number of iterations), execution time, resource usage and accuracy. We also generate Pareto-optimal realizations in the resource-accuracy space: this approach can produce optimal hardware realizations that simultaneously satisfy resource and accuracy requirements.
I. INTRODUCTION
The powering function frequently appears in many scientific and engineering applications. Accurate software routines are readily available, but they are often too slow for real-time applications. On the other hand, dedicated hardware implementations using fixed-point arithmetic are attractive as they can exhibit high performance, and low resource usage.
Direct computation of in hardware usually requires table lookup methods and polynomial approximations, which are not efficient in terms of resource usage. The authors in [1] propose an efficient composite algorithm for floating point arithmetic. The work in [2] describes the implementation of = ln in floating point arithmetic using a table-based reduction with polynomial approximation for and a range reduction method for ln . Alternatively, we can efficiently implement and ln using the well-known hyperbolic CORDIC algorithm [3] : a fixed point architecture with an expanded range of convergence is presented in [4] , and a scale-free fixed point hardware is described in [5] . There are other algorithms that outperform CORDIC under certain conditions: the BKM algorithm [6] generalizes CORDIC and features some advantages in the residue number system, and a modification of CORDIC is proposed in [7] for faster computation of . All the listed methods impose a constraint on the domain of the functions.
This work presents a fixed-point hardware for computation. We use the hyperbolic CORDIC algorithm with expanded range of convergence [8] to first implement and ln , and then = ln . Compared to a floating point implementation presented in [9] , a fixed point hardware reduces resource usage at the expense of reduced dynamic range and accuracy. The main contributions of this work include:
 Open-source, generic and customized architecture validated on an FPGA: The architecture, developed at the register transfer level in fully parameterized VHDL code, can be implemented on any existing hardware technology (e.g.: FPGA, ASIC, Programmable SoC).
 Design space exploration: We explore trade-offs among resources, performance, accuracy, and hardware design parameters. In particular, we study the effect of the fixed point format on accuracy.
 Pareto-optimal realizations based on accuracy and resource usage: By generating the set of optimal (in the multi-objective sense) architectures, we can optimally manage resources by modifying accuracy requirements.
The paper is organized as follows: Section II details the CORDIC-based computation of . Section III describes the architecture for , ln , and . Section IV details the experimental setup. Section V presents results in terms of execution time, accuracy, resources, and Pareto-optimal realizations. Conclusions are provided in Section VI.
II. CORDIC-BASED CALCULATION OF
Here, we explain how the expanded hyperbolic CORDIC algorithm is used to compute ln and , from which we get = ln . We then analyze the input domain bounds of resulting from the expanded hyperbolic CORDIC algorithm.
A. Expanded hyperbolic CORDIC to compute
The original hyperbolic CORDIC algorithm has a limited range of convergence. To address this issue, the expanded hyperbolic CORDIC algorithm [8] introduces additional iterations with negative indices. The value of  depends on the operation mode:
(1) There are + 1 negative iterations ( = − , … , −1,0) and positive iterations ( = 1,2, … , ). The iterations 4, 13, 40, … , , 3 + 1 must be repeated to guarantee convergence. For sufficiently large , the values of , , converge to: Table I shows, as increases, how the expanded CORDIC dramatically expands the argument bounds of and ln The expanded CORDIC is thus crucial for proper computation.
B. Computation of
To compute = ln , we first use CORDIC in the vectoring mode with = + 1, = − 1, = 0 to get = (ln ) 2 ⁄ . We then apply × 2 = ln . Finally, we use CORDIC in the rotation mode with = = 1⁄ , = ln to get = ln = .
Fig . 1 depicts the input domain bound (area bounded by the curve) as a function of for = ln , which is given by | ln | ≤ ( ). These are the ( , ) values for which converges. Note the asymptotes when → 1 (as ln → 0) and → 0. The input domain does not include ≤ 0, as ln is undefined for ≤ 0. Thus, the algorithm can only compute for > 0. For < 0 and an integer , we can compute | | and (−1) separately; for non-integer , we cannot compute as the result is a complex number.
III. FIXED-POINT ITERATIVE ARCHITECTURE FOR
Here, we describe the fixed-point hardware that computes = ln . This architecture is based on an expanded hyperbolic CORDIC engine that can compute and ln . number of positive iterations, + 1: number of negative iterations.
A. Hyperbolic CORDIC engine
x in y in z in bottom one requires three adders. A state machine controls the iteration counter for , the add/sub input of the adders, the loading of the registers, and the multiplexer selectors.
We use the fixed point format [ ] through all the datapath, with = − integer bits and fractional bits. The customized hyperbolic CORDIC architecture allows the user to modify the design parameters: number of bits ( ), number of fractional bits ( ), number of positive iterations ( ), and number of negative iterations ( + 1). The use of fixed point arithmetic optimizes resource usage. However, as it features a small numeric range, we might not be able to use the entire input domain of the algorithm (see Table I ).
B. Architecture for Powering Computation:
Fig . 3 depicts the block diagram of the circuit that implements = ln . Note that the same bit-width ( ) is used throughout the architecture. This circuit utilizes one hyperbolic CORDIC engine in two steps:
First, we load = + 1, = − 1, = 0 onto the CORDIC engine in the vectoring mode, so that = ln 2 ⁄ . To get + 1, − 1, we use adders with a constant input. A shifter generates ln . A fixed point multiplier then computes ln , which is fed back into the CORDIC engine. In the second step, we load = = 1⁄ , = ln onto the CORDIC engine in the rotation mode, so that we get = ln = .
The design parameters of the architecture are: bit-width ( ), fractional bit-width ( ), number of positive iterations ( ), and number of negative iterations ( + 1). This allows for fine control of accuracy, execution time, and resources.
IV. EXPERIMENTAL SETUP

A. Selection of parameters for design space exploration
The parameterized VHDL code allows for the generation of a space of hardware profiles by varying the design parameters. We consider: (24, 28, 32, 36, 40, 48, 52, 56, 60, 64, 68, 72, 76) and (8, 12, 16, 20, 24, 28, 32, 36, 40) . A discussion on the format [ ] is presented in Section IV.C. For simplicity's sake, we fix = 5 (6 negative iterations). Each of the functions , ln , requires a different architecture. For each function, we generate 13 × 9 = 117 different hardware profiles. Results are obtained for every hardware profile and for every function.
B. Generation of input signals
For and ln , we selected 1000 equally spaced points in the allowable input domain listed in Table I For ln , = 37 bits are required to cover the input domain. Thus, we included cases with > 68 bits in Table II. The scaling factor provided as a constant input to the architecture, depends on and as per (6) . The VHDL code was synthesized on a Xilinx® Zynq-7000 XC7Z010 SoC (ARM processor + FPGA fabric) that runs at 125MHz. 451.5 dB Fig. 5 shows resource usage only in terms of 6-input LUTs and 1-bit registers for the fourteen bit-widths of Table II and for the , ln , and architectures. As the number of bits grow, so does the resources. The LUT increase is more pronounced, indicating a large combinational cost. Here, we fixed = 5.
B. Resource usage
The effect of on resource usage is negligible: only affects the size of the LUT for the angles and the state machine.
C. Accuracy
For accuracy, we use the peak signal-to-noise ratio: ( ) = 10 × log 10 ( 2 ⁄ ), where MSE is the mean squared error between the results of our architecture and the reference results provided by the MATLAB® built-in function in double floating point precision.
is defined as the largest value of the fixed point output format. However, for consistency, we use the shortest fixed point format that can represent the largest output value for each function (this might differ from the one in Table II) .
To validate our selection of fixed point formats, Fig. 6 shows accuracy results for = 40 for and ln in the input domain of Section IV.B. Note the very poor accuracy when < 20 and < 37 for and ln respectively. We can also see the effect of the number of fractional bits on accuracy.
Figs. 7, 8, and 9 plot accuracy as a function of the number of positive iterations ( ) and the bit-width ( ) for , ln , and respectively. In each case, note how the PSNR values stabilize after a certain number of iterations. For , the case = 24 yields poor results regardless of the value of . For ln , the cases < 72 yield poor results. For , we tested with the ( , ) domain specified in Section IV.B (this is not the full allowable domain); here, the cases < 40 yield poor results.
For and = 24, 16 integer bits is insufficient to properly represent many intermediate and output values, hence the poor accuracy results. This is illustrated in Fig. 10 , where we plot the For ln (and thus ), note that if < 72, then < 37. This does not properly represent the entire input domain of ln (Table I As for , we detailed some issues when using the full allowable convergence domain for and ln , this provides a hint on the behavior of . 
D. Multi-objective optimization of the design space for
Since the execution time depends solely on the and , we consider it more important to illustrate the trade-offs between accuracy and resources. We present the accuracy-resources plot for all design parameters for in Fig. 13 . This allows for a rapid trade-off evaluation of resources (given in Zynq-7000 slices) and accuracy for every generated hardware profile.
Moreover, Fig. 13 also depicts the Pareto-optimal [10] set of architectures that we extracted from the design space. This allows us to discard, for example, hardware profiles ( > 52) that require more resources for no increment in accuracy. The figure also indicates the design parameters that generate each Pareto point. There are hardware realizations featuring poor accuracy (less than 40 dB) in the Pareto front. For design purposes, these points should not be considered. ] and = 20 meets these constraints, and v) Accuracy > 40 dB for no more than 1000 Slices: Three hardware profiles satisfy these constraints. We select the one that further optimizes a particular need: accuracy or resources.
VI. CONCLUSIONS
A fully parameterized fixed point iterative architecture for computation was presented and thoroughly validated. The expanded CORDIC approach allows for customized improved bounds on the domain of . The Pareto-optimal architectures extracted from the multi-objective design space allows us to solve optimization problems subject to resources and accuracy constraints. We also provided a comprehensive assessment of how the fixed-point architecture affects the functions.
Further efforts will focus on the implementation of a family of architectures for , ranging from the iterative version presented here to a fully pipelined version. We will also explore the use of the scale free hyperbolic CORDIC [5] which requires fewer iterations for the same interval of convergence. 
