Abstract: This paper presents a unified reconfigurable coordinate rotation digital computer (CORDIC) processor for floating-point arithmetic. It can be configured to operate in multi-mode to achieve a variety of operations and replaces multiple single-mode CORDIC processors. A reconfigurable pipeline-parallel mixed architecture is proposed to adapt different operations, which maximizes the sharing of common hardware circuit and achieves the area-delay-efficiency. Compared with previous unified floating-point CORDIC processors, the consumption of hardware resources is greatly reduced. As a proof of concept, we apply it to 16384  16384 points target Synthetic Aperture Radar (SAR) imaging system, which is implemented on Xilinx XC7VX690T FPGA platform. The maximum relative error of each phase function between hardware and software computation and the corresponding SAR imaging result can meet the accuracy index requirements.
Introduction
The CORDIC algorithm involves a simple shift-and-add iterative procedure to perform several computing tasks. It can execute the rotation of a two-dimensional (2-D) vector in linear, circular, and hyperbolic coordinates systems [1] . Due to the simplicity of its hardware implementation, CORDIC has a wide range of applications in signal processing and image processing, such as QR decomposition [2] , singular value decomposition [3] , 3-D graphics [4] and robotics [5] . The hardware implementation of these applications requires more than one CORDIC processor operating in different modes and different trajectories.
Several unified CORDIC processors have been designed and implemented in previous research. However, previous researches mainly focus on performing fixed-point number arithmetic system based on circular and hyperbolic coordinate. None of the previous designs include linear coordinate system. Literature [6] adopts the traditional pipeline iteration, and just adds the selection signal for mode selection. A reconfigurable CORDIC processor based on scaling-free CORDIC algorithm is demonstrated in [7] , which simplifies the iterative computation at the expense of low accuracy. A floating-point CORDIC co-processor is proposed in [8] , but the hardware architecture of pre-and post-processing module is too complicated and consumes large resources.
In this paper, we propose a unified reconfigurable floating-point CORDIC processor, which can be operated in three coordinate systems using either rotation-mode or vectoring-mode to complete a variety of operations. The range of convergence (ROC) of algorithm is extended by domain conversion, and the software per-simulation method is used to optimal data width and iterative numbers. According to the dynamically configurable features of FPGA, we optimize the iterative module by designing a pipeline-parallel mixed architecture based on binary-to-bipolar recoding(BBR) [9] techniques, which maximizes the sharing of common hardware circuit in different configurations. The proposed processor achieves high precision, less time latency as well as hardware-complexity of implementation.
To further validate the design, we establish the FPGA-based prototype and apply it to the calculation of three phase functions in chirp-scaling (CS) SAR imaging algorithm [10] . Compared with the simulation results of software, the relative error is less than 10 -3 . After imaging processing, according to the quality assessment method as [11] described, the result is suitable for both vision and index requirements.
The rest of the article is organized as follows. In section 2, we review CORDIC algorithm. In Section 3, we deduce and simulate the ROC of the algorithm, and analysis the relationship of iterative numbers and data width. In section 4, we propose a unified reconfigurable CORDIC processor. Section 5 discusses the synthesis results of FPGA implementation and precision. Section 6 summarizes the main contributions of this work and concludes the paper.
Review of unified CORDIC algorithm
In 1971, Walther reformulated the CORDIC algorithm into a generalized and unified form which is suitable to perform rotations in circular, linear and hyperbolic coordinate systems [1] . It has laid a solid theoretical foundation for designing a unified CORDIC processor architecture. The generalized CORDIC is formulated as follows:
The proposed reconfigurable CORDIC processor in this paper is aimed at the calculation of phase functions in chirp-scaling algorithm. Table I shows these elementary functions and operations which can be implemented by CORDIC. The pre-and post-processing step are necessary to perform the operation. 
Software Pre-analysis of ROC and Iterative Numbers
Analysis of ROC
The ROC is an important aspect in the design of the CORDIC processor and determines the scope of the algorithm. We analyze the ROC based on theoretical deduction and verify it based on MATLAB simulation. The ROC of five types of floating-point operation is shown in Figure 1 .  Multiplication According to the iterative formula of Z-path, after N iterations, the formula can be derived as follows:
, initial value need to satisfy 1 1 Z  to ensure that 1 N Z  converges to 0 after N iterations.

Division According to iterative formula of Y-path, after N iterations, the formula can be derived as follows:
Hence, initial value need to satisfy 1 1 1 Y X  to ensure that  for arctangent function, they can't cover the entire cycle. Therefore, it is necessary to expand the range with some mathematical relationships.  Square-root In this paper, we discuss two kinds of transformation solutions for the input coordinates of the initial vector:
Where, X represents the operand. Based on software simulation, the relative error graphs of five types operation are shown in Figure 1 . The relative error r E elative can be represented as:
Where, results R is software(MATLAB) simulation results of CORDIC algorithm, arithmetic R is the reference results provided by the MATLAB built-in arithmetic function in single float point precision.
For square-root, the ROC of 1 st and 2 nd solution is (0.12,9.36) and (0.03,2.34) respectively. For fixed-point arithmetic, 1 st solution is only operated in the integer part of the data, therefore, it will produce less carry and make the operation simpler. Besides, compared to 2 nd solution, the ROC is more advisable. Finally, in hardware implementation stage, the convergence domain can be expanded by changing the decimal position of fixed-point numbers. 
Analysis of optimum iterative numbers and data width
There is a need for trade-off between hardware-cost, latency and numerical accuracy subject to the application. The accuracy of the CORDIC algorithm is related to the data bit width b and iterative numbers N. The total quantization error consists of two parts: approximate error and rounding error.
According to [13] , the approximate error A E and rounding error R E can be described as:
Where,  
. Thus, the total quantification error total E is:
Based on the above theoretical derivation, we introduce the simulation-based method to confirm optimal number of iterations and data width, as shown in Figure 2 . In order to satisfy , the number of iteration is set to 24, the data width b is set to 23, the word-length of the fixed-point number in rotation unit is set to 25 bits, which includes 1-bit sign, 1-bit integer and 23-bit decimal. 
Preprints (www.preprints.org) | NOT PEER-REVIEWED |

Proposed reconfigurable CORDIC processor
The proposed reconfigurable floating-point CORDIC processor design is shown in Figure 3 . By adding selectors, common circuits in different modes can be maximized for reuse. We set 2-bit signal T1&T0 valued as 00,01 or 10 to represent circular, linear or hyperbolic coordinate system respectively, and 1-bit signal P equals to 0 or 1 to represent rotation or vector mode respectively. 
Pre-processing module
This module mainly completes the conversion from floating-point to fixed-point numbers and expands the ROC. In this paper, we adopt the IEEE-754 standard single precision floating-point data format [14] . The input data can be represented as:
Where, 127 Bias  , the data consists of 1-bit for sign(S), 8-bit for exponent(E), and rest of 23-bit for mantissa(M) or fractional part. Figure 4 shows the hardware structure of pre-processing module. Depending on the operation selected by user, appropriate data path is picked up by signal T1&T0&P. The outputs of the pre-processing module, i.e. X0, Y0, and Z0, in different data paths are shown in Table 2 . 
Trigonometric functions operation
The mantissa is shifted according to the result of E minus 127, and it is converted into fixed-point form of 1-bit sign, 1-bit integer and 23-bit decimal part. In this module, the fixed-point numbers are expressed as DX, DY and DZ. Then, we use the method of mathematical transformation to map the data of entire circumference to the range of which can be covered by CORDIC algorithm. We divide the entire circumference into five intervals and encode it. Through the mapping relations in Table 3 , the input phase or vector in interval B, C, D or E can be transformed into interval A. NX, NY, NZ are the output values after operation. 
Domain
Range of the
The mantissa is expressed as fixed-point number which includes 1-bit sign, 1-bit integer and 23-bit decimal part as the input of CORDIC rotation unit. For multiplication, EY=EX+EZ-127. For division, EZ=EY-EX+127, in order to ensure |Y/X|<1, we add a right shifter in Y data path.
 Square-root operation
Determining the parity of exponent firstly, the exponent is divided by 2 directly, according to the exponent, the mantissa part is expressed as the corresponding fixed-point number format. The initial value of X-and Y-path is MX plus 1, MY minus 1 respectively, where, MX equal to MY.
Design of reconfigurable CORDIC rotation module
Rotation unit A
Rotation unit A is used to implement pipeline iterative structure, the single layer iteration structure is shown in Figure 5 , it requires two 1-bit shifters, three multiplexers and three 25-bit add/sub units. Rotation direction selx, sely and selz are defined by MUX3. Table 4 shows the selection scheme of different multiplexers. 
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted
Rotation unit B
Rotation unit B is used to implement parallel structure. Based on binary-to-bipolar recoding (BBR) technique, in rotation mode, rotation directions i  can be predicted by the binary value of the initial input angle in parallel. Parallel processing method unfold the micro-rotation directly. Thus, the rotation of X-and Y-path can be executed concurrently. The iterations from the (N/2+1)th to the (N+1)th can be simplified as follows:
On the basis of formula (9), we merge 13-24 micro-rotation stages. Because too much logic operation will lead to the decline of the overall clock frequency, we adopt a tree structure accumulation circuit in the design. Compared to traditional CORDIC structure, it saves 8 clock cycles (from 12 to 4) in the condition of low routing stress of hardware. The architecture of reconfigurable rotation unit B is shown in Figure 6 . Table 5 , proper data are picked up in three paths. Then, according to leading '1' location of fixed-point data except sign bit, the exponent part is normalized and mantissa part is determined, finally, they are spliced together as the outputs, the fixed-point number is converted into floating-point number. The final output results are represented as X_result, Y_result, Z_result, respectively. The hardware structure is shown in Figure 7 . 
Proposed unified reconfigurable CORDIC architecture
In rotation module, according to the dynamically configurable features of FPGA, we propose a pipeline-parallel mixed architecture, 1st to 12th iterative units are same in different modes, where all of them adopt rotation unit A to implement in pipeline. The differences are mainly reflected in the 13th to 24th units, as shown in Table 6 . Module II is used to complete the compensation of scaling factor K in square root operation. The unified reconfigurable architecture is shown in Figure 8 . 
Synthesis results
The reconfigurable CORDIC processor is coded in VHDL and synthesized using the Xilinx ISE 14.7 development tool. In order to achieve a fair comparison with other references, we choose Virtex5 XC5VFX130T FPGA as platform for implementation. The input data of processor uses single precision floating-point numbers, the X, Y and Z path in CORDIC rotation unit use 25-bit fixed-point numbers. Table 7 shows the FPGA resource occupation and a comparison with several previous works. We use the average relative error(ARE) in different modes for accuracy analysis, the relative error is expressed by formula (5), where, results R is the hardware simulation results of our designed processor. We can see, in the condition of same ARE, the LUT and register consumption are less than that of the related design described in [15] , [17] and [8] . Compared with references [16] , the proposed design can achieve higher frequency.
In order to compare with the unified fixed-point processor, we synthesis the rotation unit module separately, as shown in Table 8 , when the data format is 16-bit fixed point and the iterative number is 17, the total consumption of LUTs and registers is similar with [6] and [7] .However, [6] and [7] can only operate in circular and hyperbolic coordinate system, in addition to these two systems, our design also integrates linear coordinate system, which increase the stress of place and route, the max working frequency is lower than literature [6] , but it is higher than [7] due to a simple hardware architecture design. When the data format is 25-bit fixed point, the iterative number is 24, the resource consumption is slightly larger, however, it can achieve higher precision. Therefore, the proposed processor can make a good compromise in resources, accuracy and speed. 
Precision analysis
To verify the precision of our design, we apply the proposed processor to SAR imaging system, test scenario is 16384  16384 points target scene, the SAR imaging system is implemented in Virtex7
XC7VX690T FPGA. Due to the large data granularity, we adopt region-constant phase multipliers [18] , the phase functions in the chirp scaling imaging algorithm are extracted by 8:1, the number of each phase function is 2048. We compare the three phase functions' hardware simulation results
results
R
with the corresponding MATLAB simulation results arithmetic R , on the basis of formula (5), calculating the maximum relative error for each factor line, as shown in Figure 9 . We can see the value is between 10 -3 and 10 -5 , which is acceptable in high-resolution imaging. Generally, the precision of phase function mainly influences the imaging quality. Thus, we take advantage of the image quality assessment methodology described in [11] with integrity SAR imaging procedure to experiment the proposed phase function result. The imaging results of MATLAB and FPGA are shown in Figure 10 (a), (b), respectively, Table 9 shows the result of the point target imaging quality assessment. The peak side-lobe ratio (PSLR), integrated side-lobe ratio (ISLR) and resolution(RES) are commonly adopted to evaluate the imaging quality [11] . We can see that the imaging quality is good, approximate 2-dB degrade of PSLR and ISLR meet the requirements of the engineering indicators.
(a) MATLAB imaging (b) FPGA imaging Figure 10 . Point target imaging result 
Conclusions
In this paper, we design a unified reconfigurable floating-point CORDIC processor, which can be operated in different modes and calculate varieties of floating-point arithmetic. It replaces multiple single-mode processors and reduces the numbers of processors needed in the operation. Besides, the ROC is extended and multiple operations are integrated. A reconfigurable pipeline-parallel mixed architecture is proposed in rotation module to adapt different floating-point operations. Compared to traditional CORDIC processor, it greatly saves hardware resources and has high accuracy, with an acceptable maximum working frequency. Besides, the relative error of each phase function line between hardware and software computation is acceptable, and the corresponding SAR imaging result is good. 
