CORDIC-based Architecture for Powering Computation in Fixed-Point
  Arithmetic by Simmonds, Nia et al.
CORDIC-based Architecture for Powering 
Computation in Fixed Point Arithmetic 
Nia Simmonds†, Joshua Mack 
†Dept. of Electrical Engineering and Computer Science 
Case Western Reserve University, OH, USA 
†Electrical and Computer Engineering Department 
The University of Arizona, Tucson, AZ, USA 
nia.simmonds@case.edu, jmack2545@email.arizona.edu  
Sam Bellestri, Daniel Llamocca 
Electrical and Computer Engineering Department 
The University of Alabama, AL, USA 
Oakland University, Rochester, MI, USA 
sdbellestri@crimson.ua.edu, llamocca@oakland.edu 
 
 
Abstract—We present a fixed point architecture (source VHDL 
code is provided) for powering computation. The fully customized 
architecture, based on the expanded hyperbolic CORDIC 
algorithm, allows for design space exploration to establish trade-
offs among design parameters (numerical format, number of 
iterations), execution time, resource usage and accuracy. We also 
generate Pareto-optimal realizations in the resource-accuracy 
space: this approach can produce optimal hardware realizations 
that simultaneously satisfy resource and accuracy requirements. 
Keywords—fixed point arithmetic, coordinate rotation digital 
computer (CORDIC), logarithm, powering. 
I. INTRODUCTION 
The powering function frequently appears in many scientific 
and engineering applications. Accurate software routines are 
readily available, but they are often too slow for real-time 
applications. On the other hand, dedicated hardware 
implementations using fixed-point arithmetic are attractive as 
they can exhibit high performance, and low resource usage. 
Direct computation of 𝑥𝑦 in hardware usually requires table 
lookup methods and polynomial approximations, which are not 
efficient in terms of resource usage. The authors in [1] propose 
an efficient composite algorithm for floating point arithmetic. 
The work in [2] describes the implementation of 𝑥𝑦 = 𝑒𝑦 ln 𝑥 in 
floating point arithmetic using a table-based reduction with 
polynomial approximation for 𝑒𝑥 and a range reduction method 
for ln 𝑥. Alternatively, we can efficiently implement 𝑒𝑥 and ln 𝑥 
using the well-known hyperbolic CORDIC algorithm [3]: a 
fixed point architecture with an expanded range of convergence 
is presented in [4], and a scale-free fixed point hardware is 
described in [5]. There are other algorithms that outperform 
CORDIC under certain conditions: the BKM algorithm [6] 
generalizes CORDIC and features some advantages in the 
residue number system, and a modification of CORDIC is 
proposed in [7] for faster computation of 𝑒𝑥 . All the listed 
methods impose a constraint on the domain of the functions. 
This work presents a fixed-point hardware for 𝑥𝑦 
computation. We use the hyperbolic CORDIC algorithm with 
expanded range of convergence [8] to first implement 𝑒𝑥  and 
ln 𝑥 , and then 𝑥𝑦 = 𝑒𝑦 ln 𝑥 . Compared to a floating point 
implementation presented in [9], a fixed point hardware reduces 
resource usage at the expense of reduced dynamic range and 
accuracy. The main contributions of this work include:  
 Open-source, generic and customized architecture 
validated on an FPGA: The architecture, developed at 
the register transfer level in fully parameterized VHDL 
code, can be implemented on any existing hardware 
technology (e.g.: FPGA, ASIC, Programmable SoC). 
 Design space exploration: We explore trade-offs among 
resources, performance, accuracy, and hardware design 
parameters. In particular, we study the effect of the fixed 
point format on accuracy. 
 Pareto-optimal realizations based on accuracy and 
resource usage: By generating the set of optimal (in the 
multi-objective sense) architectures, we can optimally 
manage resources by modifying accuracy requirements.  
The paper is organized as follows: Section II details the 
CORDIC-based computation of 𝑥𝑦 . Section III describes the 
architecture for 𝑒𝑥 , ln 𝑥 , and 𝑥𝑦 . Section IV details the 
experimental setup. Section V presents results in terms of 
execution time, accuracy, resources, and Pareto-optimal 
realizations. Conclusions are provided in Section VI. 
II. CORDIC-BASED CALCULATION OF 𝑥𝑦 
Here, we explain how the expanded hyperbolic CORDIC 
algorithm is used to compute ln 𝑥 and 𝑒𝑥, from which we get  
𝑥𝑦 = 𝑒𝑦 ln 𝑥. We then analyze the input domain bounds of  𝑥𝑦 
resulting from the expanded hyperbolic CORDIC algorithm.  
A. Expanded hyperbolic CORDIC to compute 𝑙𝑛 𝑥  𝑎𝑛𝑑 𝑒𝑥 
The original hyperbolic CORDIC algorithm has a limited 
range of convergence. To address this issue, the expanded 
hyperbolic CORDIC algorithm [8] introduces additional 
iterations with negative indices. The value of 𝑖 depends on the 
operation mode: 
𝑖 ≤ 0: {
𝑥𝑖+1 = 𝑥𝑖 + 𝑖𝑦𝑖(1 − 2
𝑖−2)
𝑦𝑖+1 = 𝑦𝑖 + 𝑖𝑥𝑖(1 − 2
𝑖−2)
𝑧𝑖+1 = 𝑧𝑖 − 𝑖𝑖 , 𝑖 = 𝑇𝑎𝑛ℎ
−1(1 − 2𝑖−2)
 (1) 
𝑖 > 0: {
𝑥𝑖+1 = 𝑥𝑖 + 𝑖𝑦𝑖2
−𝑖
𝑦𝑖+1 = 𝑦𝑖 + 𝑖𝑥𝑖2
−𝑖
𝑧𝑖+1 = 𝑧𝑖 − 𝑖𝑖 , 𝑖 = 𝑇𝑎𝑛ℎ
−1(2−𝑖)
 (2) 
𝑅𝑜𝑡𝑎𝑡𝑖𝑜𝑛: 𝛿𝑖 = −1 𝑖𝑓 𝑧𝑖 < 0; +1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑉𝑒𝑐𝑡𝑜𝑟𝑖𝑛𝑔: 𝛿𝑖 = −1 𝑖𝑓 𝑥𝑖𝑦𝑖 ≥ 0; +1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 (3) 
This material is based upon work supported by the National Science 
Foundation under Grant No. EEC-1263133. 
TABLE I. CORDIC CONVERGENCE BOUNDS FOR THE ARGUMENT OF THE 
FUNCTIONS AS M INCREASES. THE ORIGINAL CORDIC CASE IS INCLUDED. 
M 𝑐𝑜𝑠ℎ𝑥, 𝑠𝑖𝑛ℎ𝑥, 𝑒𝑥 ln 𝑥 
Original CORDIC [−1.11820, 1.11820] (0, 9.35958] 
0 [−2.09113, 2.09113] (0, 65.51375] 
1 [−3.44515, 3.44515] (0, 982.69618] 
2 [−5.16215, 5, 16215] (0, 3.04640 × 104] 
3 [−7.23371, 7.23371] (0, 1.91920 × 106] 
4 [−9.65581, 9.65581] (0, 2.43742 × 108] 
5 [−12.42644, 12.42644] (0, 6.21539 × 1010] 
6 [−15.54462, 15,54462] (0,3.17604 × 1013] 
7 [−19.00987, 19.00987] (0, 3.24910 × 1016] 
8 [−22.82194, 22.82194] (0, 6.65097 × 1019] 
9 [−26.98070, 26,98070] (0, 2.72357 × 1023] 
10 [−31.48609, 31.48609] (0, 2.23085 × 1027] 
 
There are 𝑀 + 1 negative iterations (𝑖 = −𝑀, … , −1,0) and 
𝑁  positive iterations ( 𝑖 = 1,2, … , 𝑁 ). The iterations 4, 13,
40, … , 𝑘, 3𝑘 + 1  must be repeated to guarantee convergence. 
For sufficiently large 𝑁, the values of 𝑥𝑛 , 𝑦𝑛 , 𝑧𝑛 converge to: 
𝑅𝑜𝑡𝑎𝑡𝑖𝑜𝑛: {
𝑥𝑛 = 𝐴𝑛(𝑥𝑖𝑛𝑐𝑜𝑠ℎ𝑧𝑖𝑛 + 𝑦𝑖𝑛𝑠𝑖𝑛ℎ𝑧𝑖𝑛)
𝑦𝑛 = 𝐴𝑛(𝑥𝑖𝑛𝑐𝑜𝑠ℎ𝑧𝑖𝑛 + 𝑦𝑖𝑛𝑠𝑖𝑛ℎ𝑧𝑖𝑛)
𝑧𝑛 = 0
 (4) 
𝑉𝑒𝑐𝑡𝑜𝑟𝑖𝑛𝑔: {
𝑥𝑛 = 𝐴𝑛√𝑥𝑖𝑛
2 − 𝑦𝑖𝑛
2
𝑦𝑛 = 0
𝑧𝑛 = 𝑧𝑖𝑛 + 𝑡𝑎𝑛ℎ
−1(𝑦𝑖𝑛 𝑥𝑖𝑛⁄ )
 (5) 
𝐴𝑛 = (∏ √1 − (1 − 2𝑖−2)2
0
𝑖=−𝑀 ) ∏ √1 − 2
−2𝑖𝑁
𝑖=1  (6) 
Note that 𝑥𝑖𝑛  = 𝑥−𝑀 , 𝑦𝑖𝑛  = 𝑦−𝑀, 𝑧𝑖𝑛  = 𝑧−𝑀. To get 𝑒
𝛼 =
𝑐𝑜𝑠ℎ𝛼 + 𝑠𝑖𝑛ℎ𝛼 , we use 𝑥𝑖𝑛 = 𝑦𝑖𝑛 = 1 𝐴𝑛⁄ , 𝑧𝑖𝑛 = 𝛼  in the 
rotation mode. To get (ln 𝛼) 2⁄ = 𝑡𝑎𝑛ℎ−1(𝛼 − 1 𝛼 + 1⁄ ), we 
use 𝑥𝑖𝑛 = 𝛼 + 1, 𝑦𝑖𝑛 = 𝛼 − 1, 𝑧𝑖𝑛 = 0 in the vectoring mode.  
The parameter 𝑀 controls the range of convergence of the 
expanded hyperbolic CORDIC:  [−𝜃𝑚𝑎𝑥(𝑀), 𝜃𝑚𝑎𝑥(𝑀)]. This is 
the bound on the domain of 𝑐𝑜𝑠ℎ/𝑠𝑖𝑛ℎ/𝑒𝑥  and the range of 
𝑡𝑎𝑛ℎ−1. For ln 𝑥, the argument is bounded by (0, 𝑒𝜃𝑚𝑎𝑥(𝑀)×2]. 
Table I shows, as 𝑀  increases, how the expanded CORDIC 
dramatically expands the argument bounds of 𝑒𝑥 and ln 𝑥 The 
expanded CORDIC is thus crucial for proper 𝑥𝑦 computation. 
B. Computation of 𝑥𝑦 
To compute 𝑥𝑦 = 𝑒𝑦 ln 𝑥 , we first use CORDIC in the 
vectoring mode with 𝑥𝑖𝑛 = 𝑥 + 1, 𝑦𝑖𝑛 = 𝑥 − 1, 𝑧𝑖𝑛 = 0 to get 
𝑧𝑛 = (ln 𝑥) 2⁄ . We then apply 𝑧𝑛 × 2𝑦 = 𝑦 ln 𝑥 . Finally, we 
use CORDIC in the rotation mode with 𝑥𝑖𝑛 = 𝑦𝑖𝑛 = 1 𝐴𝑛⁄ ,
𝑧𝑖𝑛 = 𝑦 ln 𝑥 to get 𝑥𝑛 = 𝑒
𝑦 ln 𝑥 = 𝑥𝑦. 
Fig. 1 depicts the input domain bound (area bounded by the 
curve) as a function of 𝑀 for 𝑥𝑦 = 𝑒𝑦 ln 𝑥 , which is given by 
|𝑦 ln 𝑥| ≤ 𝜃𝑚𝑎𝑥(𝑀). These are the (𝑥, 𝑦) values for which 𝑥
𝑦 
converges. Note the asymptotes when 𝑥 → 1 (as ln 𝑥 → 0) and 
𝑦 → 0 . The input domain does not include 𝑥 ≤ 0 , as ln 𝑥  is 
undefined for 𝑥 ≤ 0. Thus, the algorithm can only compute 𝑥𝑦 
for 𝑥 > 0. For 𝑥 < 0 and an integer 𝑦, we can compute |𝑥|𝑦 and 
(−1)𝑦 separately; for non-integer 𝑦, we cannot compute 𝑥𝑦 as 
the result is a complex number.  
III. FIXED-POINT ITERATIVE ARCHITECTURE FOR 𝑥𝑦 
Here, we describe the fixed-point hardware that computes 
𝑥𝑦 = 𝑒𝑦 ln 𝑥 . This architecture is based on an expanded 
hyperbolic CORDIC engine that can compute 𝑒𝑥 and ln 𝑥.  
A. Hyperbolic CORDIC engine 
Fig. 2 depicts the hyperbolic CORDIC engine. The top stage 
implements the 𝑀 + 1  negative iterations, while the bottom 
stage implements the 𝑁 positive iterations. By proper selection 
of 𝑥𝑖𝑛 , 𝑦𝑖𝑛 , 𝑧𝑖𝑛  and the operation mode, this architecture can 
compute various functions (e.g.: 𝑒𝑥, 𝑐𝑜𝑠ℎ, 𝑠𝑖𝑛ℎ, 𝑎𝑡𝑎𝑛ℎ). 
Each stage utilizes two barrel shifters, a look-up table (LUT), 
and multiplexers. The top stage requires five adders, while the 
Figure 1. Domain of convergence for 𝑥𝑦 as a function of 𝑀: the domain 
grows as 𝑀 increases. The case 𝑀 = 5 is highlighted. 
Figure 2.  Parameterized expanded hyperbolic CORDIC engine. 𝐵: bit-width. 𝑁: 
number of positive iterations, 𝑀 + 1: number of negative iterations. 
xin yin zin
bottom one requires three adders. A state machine controls the 
iteration counter for 𝑖 , the add/sub input of the adders, the 
loading of the registers, and the multiplexer selectors. 
We use the fixed point format [𝐵 𝐹𝑊]  through all the 
datapath, with 𝐼𝑊 = 𝐵 − 𝐹𝑊  integer bits and 𝐹𝑊  fractional 
bits. The customized hyperbolic CORDIC architecture allows 
the user to modify the design parameters: number of bits (𝐵), 
number of fractional bits (𝐹𝑊), number of positive iterations 
(𝑁), and number of negative iterations (𝑀 + 1). The use of fixed 
point arithmetic optimizes resource usage. However, as it 
features a small numeric range, we might not be able to use the 
entire input domain of the algorithm (see Table I). 
B. Architecture for Powering Computation: 𝑥𝑦 
Fig. 3 depicts the block diagram of the circuit that 
implements 𝑥𝑦 = 𝑒𝑦 ln 𝑥 . Note that the same bit-width (𝐵) is 
used throughout the architecture. This circuit utilizes one 
hyperbolic CORDIC engine in two steps:  
First, we load 𝑥𝑖𝑛 = 𝑥 + 1, 𝑦𝑖𝑛 = 𝑥 − 1, 𝑧𝑖𝑛 = 0 onto the 
CORDIC engine in the vectoring mode, so that 𝑧𝑛 = ln 𝑥 2⁄ . To 
get 𝑥 + 1, 𝑥 − 1, we use adders with a constant input. A shifter 
generates ln 𝑥. A fixed point multiplier then computes 𝑦 ln 𝑥, 
which is fed back into the CORDIC engine. In the second step, 
we load 𝑥𝑖𝑛 = 𝑦𝑖𝑛 = 1 𝐴𝑛⁄ , 𝑧𝑖𝑛 = 𝑦 ln 𝑥  onto the CORDIC 
engine in the rotation mode, so that we get  𝑥𝑛 = 𝑒
𝑦 ln 𝑥 = 𝑥𝑦. 
The design parameters of the 𝑥𝑦 architecture are: bit-width 
(𝐵), fractional bit-width (𝐹𝑊), number of positive iterations 
(𝑁), and number of negative iterations (𝑀 + 1). This allows for 
fine control of accuracy, execution time, and resources.  
IV. EXPERIMENTAL SETUP 
A. Selection of parameters for design space exploration 
The parameterized VHDL code allows for the generation of 
a space of hardware profiles by varying the design parameters. 
We consider: 𝐵 (24, 28, 32, 36, 40, 48, 52, 56, 60, 64, 68, 72, 
76) and 𝑁 (8, 12, 16, 20, 24, 28, 32, 36, 40). A discussion on the 
format [𝐵 𝐹𝑊] is presented in Section IV.C. For simplicity’s 
sake, we fix 𝑀 = 5 (6 negative iterations). Each of the functions 
𝑒𝑥, ln 𝑥 , 𝑥𝑦 requires a different architecture. For each function, 
we generate 13 × 9 = 117 different hardware profiles. Results 
are obtained for every hardware profile and for every function. 
B. Generation of input signals 
For 𝑒𝑥 and ln 𝑥, we selected 1000 equally spaced points in 
the allowable input domain listed in Table I for 𝑀 = 5. 
For 𝑥𝑦  testing, we used 150 × 10  linearly spaced (𝑥, 𝑦) 
points, where 𝑥 ∈ [𝑒−𝜃𝑚𝑎𝑥(𝑀=5), 𝑒𝜃𝑚𝑎𝑥(𝑀=5)]  (allowed 𝑥 
interval when |𝑦| = 1). The interval for 𝑦 varies according to 𝑥 
as per the formula |𝑦 ln 𝑥| ≤ 𝜃𝑚𝑎𝑥(𝑀 = 5). 
C. Selection of fixed point formats 
By selecting the integer bit-width (𝐼𝑊) and the fractional bit-
width (𝐹𝑊), a custom fixed point format [𝐵 𝐹𝑊] is defined, 
where 𝐵 = 𝐹𝑊 + 𝐼𝑊 . The range of values is given by 
[−2𝐵−𝐹𝑊−1, 2𝐵−𝐹𝑊−1 − 2−𝐹𝑊] . Table II list the formats we 
selected for our experiments (𝐵 = 24 → 76, in increments of 4) 
along with the maximum value ( 2𝐵−𝐹𝑊−1 − 2−𝐹𝑊 ), the 
resolution (2−𝐹𝑊), and the dynamic range (2𝐵−1) in dB. 
For proper fixed point representation of the input, 
intermediate, and output values of 𝑒𝑥 , ln 𝑥, 𝑥𝑦, a large number 
of bits is required. For the selected input domains of Section 
IV.B, the 𝑒𝑥 and 𝑥𝑦 functions required 20 integer bits, while the 
ln 𝑥  function required 37 integer bits. As for the number of 
fractional bits, we start with 8 bits and then keep increasing it by 
4 up to a maximum of 32 bits. 
For 𝑒𝑥, Fig. 4 plots 𝑥𝑖 for each iteration (𝑖 = −5 → 40) for 
various 𝑥 values. Note that 𝑥𝑖 can be as twice as the final value 
𝑥𝑁 . The largest 𝑒
𝑥=𝜃𝑚𝑎𝑥(𝑀=5)  needs 𝐼𝑊 = 19, thus we need 
𝐼𝑊 = 20  to properly represent the intermediate values. To 
assess the loss in accuracy, we included a format with 𝐼𝑊 < 20. 
For ln 𝑥 , 𝐼𝑊 = 37  bits are required to cover the input 
domain. Thus, we included cases with 𝐵 > 68 bits in Table II. 
The scaling factor provided as a constant input to the 
architecture, depends on 𝑁 and 𝑀 as per (6). The VHDL code 
was synthesized on a Xilinx® Zynq-7000 XC7Z010 SoC (ARM 
processor + FPGA fabric) that runs at 125MHz. 
TABLE II. LIST OF FIXED POINT FORMATS USED IN OUR EXPERIMENTS 
IW: INTEGER WIDTH, FW: FRACTION WIDTH. 𝐵 = 𝐹𝑊 + 𝐼𝑊 
[B FW] IW Maximum value Resolution Dyn. Range 
[24 8] 16 3.277 × 104 3.906 × 10−3 138.5 dB 
[28 8] 20 5.243 × 105 3.906 × 10−3 162.6 dB 
[32 12] 20 5.243 × 105 2.441 × 10−4 186.6 dB 
[36 16] 20 5.243 × 105 1.526 × 10−5 210.7 dB 
[40 20] 20 5.243 × 105 9.536 × 10−7 234.8 dB 
[44 24] 20 5.243 × 105 5.960 × 10−8 258.9 dB 
[48 28] 20 5.243 × 105 3.725 × 10−9 283.0 dB 
[52 32] 20 5.243 × 105 2.328 × 10−10 307.1 dB 
[56 32] 24 8.388 × 106 2.328 × 10−10 331.1 dB 
[60 32] 28 1.342 × 108 2.328 × 10−10 355.2 dB 
[64 32] 32 2.147 × 109 2.328 × 10−10 379.3 dB 
[68 32] 36 3.436 × 1010 2.328 × 10−10 403.4 dB 
[72 32] 40 5.497 × 1011 2.328 × 10−10 427.5 dB 
[76 32] 44 8.796 × 1012 2.328 × 10−10 451.5 dB 
Figure 3. Block Diagram for Powering (𝑥𝑦) computation. The expanded 
hyperbolic CORDIC engine is utilized twice. 
FIXED-POINT
EXPANDED HYPERBOLIC 
CORDIC ENGINE
0 1
M
N
OutLd
<<
1OutReg
0 1 0 1
cordicSel
0
1
cordicDone
F
X
m
u
lt
ip
lie
r
1
+ -
cordic_start
mode
V. RESULTS AND ANALYSIS 
A. Execution Time 
For 𝑒𝑥  and ln 𝑥 , we require one cycle per iteration. The 
output register of each stage requires two extra cycles. For 𝑥𝑦, 
we add the execution times for 𝑒𝑥 and ln 𝑥, and an extra cycle to 
place the final result on the output register. Execution time (in 
number of cycles) depends on 𝑁 and 𝑀, and it is given by: 
 𝐸𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 (𝑒𝑥, ln 𝑥) = 𝑀 + 1 + 𝑁 + 𝑣(𝑁) + 2 (7) 
𝐸𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 (𝑥𝑦) = 2(𝑀 + 1) + 2𝑁 + 2 × 𝑣(𝑁) + 5 (8) 
𝑣(𝑁) refers to the number of repeated iterations (see Section 
II). Table III lists execution time (ns) for a clock frequency of 
125 MHz for different values of 𝑁 for the given functions. 
TABLE III. EXECUTION TIME (NS) FOR 𝑒𝑥 , ln 𝑥 , 𝑥𝑦. FREQUENCY: 125 MHZ. 
Function 
N (number of positive iterations), M=5 
8 12 16 20 24 32 36 40 
𝑒𝑥/ ln 𝑥 136 168 208 240 272 336 368 408 
𝑥𝑦 280 344 424 488 552 680 744 824 
B. Resource usage 
Fig. 5 shows resource usage only in terms of 6-input LUTs 
and 1-bit registers for the fourteen bit-widths of Table II and for 
the 𝑒𝑥, ln 𝑥, and 𝑥𝑦 architectures. As the number of bits grow, 
so does the resources. The LUT increase is more pronounced, 
indicating a large combinational cost. Here, we fixed 𝑀 = 5. 
The effect of 𝑁 on resource usage is negligible: 𝑁 only affects 
the size of the LUT for the angles and the state machine. 
C. Accuracy 
For accuracy, we use the peak signal-to-noise ratio:  
𝑃𝑆𝑁𝑅(𝑑𝐵) = 10 × log10(𝑚𝑎𝑥𝑣𝑎𝑙
2 𝑀𝑆𝐸⁄ ), where MSE is the 
mean squared error between the results of our architecture and 
the reference results provided by the MATLAB® built-in 
function in double floating point precision. 𝑚𝑎𝑥𝑣𝑎𝑙 is defined 
as the largest value of the fixed point output format. However, 
for consistency, we use the shortest fixed point format that can 
represent the largest output value for each function (this might 
differ from the one in Table II). 
To validate our selection of fixed point formats, Fig. 6 shows 
accuracy results for 𝑁 = 40 for 𝑒𝑥 and ln 𝑥 in the input domain 
of Section IV.B. Note the very poor accuracy when 𝐼𝑊 < 20 
and 𝐼𝑊 < 37 for 𝑒𝑥 and ln 𝑥 respectively. We can also see the 
effect of the number of fractional bits on accuracy. 
Figs. 7, 8, and 9 plot accuracy as a function of the number of 
positive iterations (𝑁) and the bit-width (𝐵) for 𝑒𝑥, ln 𝑥, and 𝑥𝑦 
respectively. In each case, note how the PSNR values stabilize 
after a certain number of iterations. For 𝑒𝑥 , the case 𝐵 = 24 
yields poor results regardless of the value of 𝑁. For ln 𝑥, the 
cases 𝐵 < 72 yield poor results. For 𝑥𝑦 , we tested with the 
(𝑥, 𝑦)  domain specified in Section IV.B (this is not the full 
allowable domain); here, the cases 𝐵 < 40 yield poor results. 
For 𝑒𝑥 and 𝐵 = 24, 16 integer bits is insufficient to properly 
represent many intermediate and output values, hence the poor 
accuracy results. This is illustrated in Fig. 10, where we plot the 
Figure 5. Resource utilization for 𝑒𝑥, ln 𝑥, 𝑥𝑦 functions. Device: Zynq-7000 
XC7Z010 SoC with 35,200 registers and 17,600 6-input LUTs 
Figure 4. 𝑒𝑥 CORDIC computation: 𝑥𝑖 vs. the iteration number for 𝑥 =
8,10,11,11.8. Note that the intermediate values can be larger than the result 𝑥𝑁. 
Figure 6. Accuracy results vs. the number of integer bits (𝐼𝑊) and fractional 
bits (𝐹𝑊) for the 𝑒𝑥 and ln 𝑥 fixed point architectures. 𝑁 = 40, 𝑀 = 5. 
hardware outputs alongside the MATLAB® built-in function’s. 
For 𝐵 = 24 , notice the point where the hardware starts 
producing incorrect values. For 𝐵 = 28, the plots look identical, 
confirming that 𝐼𝑊 = 20  bits is the minimum required to 
properly represent the intermediate and output values. 
For ln 𝑥 (and thus 𝑥𝑦), note that if 𝐵 < 72, then 𝐼𝑊 < 37. 
This does not properly represent the entire input domain of ln 𝑥 
(Table I for 𝑀 = 5), hence the poor accuracy results. Fig. 11 
illustrates this effect, where we plot our hardware’s results 
versus the MATLAB® built-in function’s. Note that for 𝐵 = 68 
(𝐼𝑊 = 36), the maximum input value allowed by the [68 36] 
format is 3.44 × 1010  (smaller than what Table I allows for 
𝑀 = 5). This is exactly the point at which the ln 𝑥 curve starts 
producing incorrect values. Thus, for 𝐵 <  72, ln 𝑥  can only 
produce correct values if we restrict the input domain to what the 
fixed point format can represent. 
As for 𝑥𝑦 , we detailed some issues when using the full 
allowable convergence domain for 𝑒𝑥 and ln 𝑥, this provides a 
hint on the behavior of 𝑥𝑦. Fig. 12 depicts the 𝑥𝑦 plot for 𝐵 =
28, 40 and the domain: 𝑥 ∈ [𝑒−𝜃𝑚𝑎𝑥(𝑀) 2⁄ , 𝑒𝜃𝑚𝑎𝑥(𝑀) 2⁄ ], |𝑦| ≤ 2, 
this ‘box’ allows for a good depiction of the 𝑥𝑦 surface. Notice 
how for 𝐵 = 28, the 𝑥𝑦 plot differs significantly in some areas 
when compared to the relatively accurate case with 𝐵 = 40. 
D. Multi-objective optimization of the design space for 𝑥𝑦 
Since the execution time depends solely on the 𝑀 and 𝑁, we 
consider it more important to illustrate the trade-offs between 
accuracy and resources. We present the accuracy-resources plot 
for all design parameters for 𝑥𝑦  in Fig. 13. This allows for a 
rapid trade-off evaluation of resources (given in Zynq-7000 
slices) and accuracy for every generated hardware profile. 
Moreover, Fig. 13 also depicts the Pareto-optimal [10] set of 
architectures that we extracted from the design space. This 
allows us to discard, for example, hardware profiles (𝐵 > 52) 
that require more resources for no increment in accuracy. The 
figure also indicates the design parameters that generate each 
Pareto point. There are hardware realizations featuring poor 
accuracy (less than 40 dB) in the Pareto front. For design 
purposes, these points should not be considered. 
Figure 7. Accuracy (PSNR) results for 𝑒𝑥. Accuracy is high for 𝐵 >  24. 
Figure 8. ln 𝑥: Accuracy (PSNR) results. Poor accuracy occurs when we use 
fewer than 37 integer bits (𝐵 <  72) to represent the ln 𝑥 domain (𝑀 = 5). 
Figure 9. Accuracy (PSNR) results for 𝑥𝑦. Accuracy is high for 𝐵 > 24. 
Figure 10. 𝒆𝒙: MATLAB® built-in function vs hardware results. 
Figure 11. ln 𝑥: MATLAB® built-in function vs hardware results. 
This approach allows the user to select only Pareto-optimal 
hardware realizations that simultaneously satisfy resources 
and/or accuracy constraints. For example: i) highest accuracy 
regardless of resource usage: the hardware with format [52 32] 
and 𝑁 = 32 iterations satisfies this requirement at the expense 
of a large resource usage, ii) lowest resource usage subject to 
accuracy  100 dB: the hardware with format [36 16] and 𝑁 =
12  iterations satisfies the constraint and minimizes resource 
usage, iii) lowest resource usage for accuracy  40 dB: the 
hardware with format [28 8] and 𝑁 = 8 meets this constraint, 
iv) highest accuracy for less than 1000 Slices: the hardware with 
format [44 24]  and 𝑁 = 20  meets these constraints, and v) 
Accuracy > 40 dB for no more than 1000 Slices: Three hardware 
profiles satisfy these constraints. We select the one that further 
optimizes a particular need: accuracy or resources. 
VI. CONCLUSIONS 
A fully parameterized fixed point iterative architecture for 
𝑥𝑦  computation was presented and thoroughly validated. The 
expanded CORDIC approach allows for customized improved 
bounds on the domain of 𝑥𝑦. The Pareto-optimal architectures 
extracted from the multi-objective design space allows us to 
solve optimization problems subject to resources and accuracy 
constraints. We also provided a comprehensive assessment of 
how the fixed-point architecture affects the functions.  
Further efforts will focus on the implementation of a family 
of architectures for 𝑥𝑦 , ranging from the iterative version 
presented here to a fully pipelined version. We will also explore 
the use of the scale free hyperbolic CORDIC [5] which requires 
fewer iterations for the same interval of convergence.  
REFERENCES 
[1] J. Piñeiro, M.D. Ercegovac, J.D. Brugera, “Algorithm and Architecture 
for Logarithm, Exponential, and Powering Computation,” IEEE 
Transactions on Computers, vol. 53, no. 9, pp. 1085-1096, Sept. 2004. 
[2] F. De Dinechin, P. Echeverría, M. López-Vallejo, B. Pasca, “Floating-
Point Exponentiation Units for Reconfigurable Computing,” ACM Trans. 
on Reconfigurable Technology and Systems, vol. 6, no. 1, p. 4, May 2013. 
[3] P. Meher, J. Valls, T.-B. Juang, K. Sridharan, K. Maharatna, “50 Years of 
CORDIC: Algorithms, Architectures, and Applications”, IEEE Trans. on 
Circuits and Systems I: Regular Papers, vol. 56, no. 9, Sept. 2009. 
[4] D. Llamocca, C. Agurto, “A Fixed-point implementation of the expanded 
hyperbolic CORDIC algorithm,” Latin American Applied Research, vol. 
37, no. 1, pp. 83-91, Jan. 2007. 
[5] S. Aggarwal, P. Meher, K. Khare, “Scale-free hyperbolic CORDIC 
processor and its application to waveform generation,” IEEE Trans. on 
Circuits and Syst. I: Reg. Papers, vol. 60, no. 2, pp. 314-326, Feb. 2013.  
[6] J. C. Bajard, S. Kla, and J. Muller, “BKM: A new hardware algorithm for 
complex elementary functions,” IEEE Transactions on Computers, vol. 
43, no. 8, pp. 955-963, August 1994. 
[7] V. Kantabutra, “On Hardware for Computing Exponential and 
Trigonometric Function,” IEEE Transactions on Computers, vol. 45, no. 
3, pp. 328-339, March 1996. 
[8] X. Hu, R.G. Harber, S.C. Bass, “Expanding the range of convergence of 
the CORDIC algorithm,” IEEE Transactions on Computers, vol. 40, no. 
1, pp. 13-21, Jan. 1991. 
[9] J. Mack, S. Bellestri, D. Llamocca, “Floating-Point CORDIC-based 
Architecture for Powering Computation,” to appear in Proceedings of the 
10th International Conference on Reconfigurable Computing and FPGAs 
(ReConFig’2015), Mayan Riviera, Mexico, December 2015.  
[10] S. Boyd and L. Vanderberghe, Convex Optimization. Cambridge, U.K.: 
Cambridge Univ. Press, 2004. 
Figure 12. 3-D plot for 𝑥𝑦. 𝑥 ∈ [𝑒−𝜃𝑚𝑎𝑥(𝑀=5) 2⁄ , 𝑒𝜃𝑚𝑎𝑥(𝑀=5) 2⁄ ], |𝑦| ≤ 2 for format [28 8] and [40 20]. Note that inaccuracies when 𝐵 = 28 (format [28 8]) 
Figure 13. Resources-accuracy space and Pareto-optimal front for 𝑥𝑦. 
Device: Zynq-7000 XC7Z010 SoC with 4,400 Slices. 
