Abstract-The computation of the arctangent of a complex number, i.e., the atan2 function, is frequently needed in hardware systems that could profit from an optimized operator. In this brief, we present a novel method to compute the atan2 function and a hardware architecture for its implementation. The method is based on a first stage that performs a coarse approximation of the atan2 function and a second stage that improves the output accuracy by means of a lookup table. We present results for fixed-point implementations in a field-programmable gate array device, all of them guaranteeing last-bit accuracy, which provide an advantage in latency, speed, and use of resources, when compared with well-established fixed-point options.
I. INTRODUCTION
The computation of the arctangent function atan2(a, b) (see Fig. 1 ), i.e., obtaining the angle of a complex number c = b + ja, has been the subject of extensive study, because this computation is required in many applications.
In hardware approximations for atan2 (a, b) , there is often a tradeoff between the use of resources and the computation speed and/or latency. For example, the fastest option for the computation of any function may always be the direct implementation with a lookup table (LUT), but, since the atan2(a, b) is a function of two input variables, in such a case, if the precision of the input data increases by one bit, the amount of memory needed increases by a factor of four. On the other hand, iterative algorithms, such as the COordinated Rotation DIgital Computer (CORDIC), can be implemented with a minimal use of resources [1] , but at the cost of a low processing speed. If more parallelism is introduced in the implementation, much higher throughputs can be achieved, but the use of resources and the computational delay are increased.
Smaller LUTs can be achieved by using the recip-mult-atan method (RMAM) [2] , [3] : first, z = a/b is calculated by computing 1/b with an LUT (avoiding thus the large LUT needed for a two-variable function) and multiplying the result by a, and finally, the one-variable function atan(z) is computed using another LUT.
Another option is to use high-order algebraic polynomials, like Chebyshev polynomials or Taylor series [1] . These methods offer good precision, but, since the arctangent is highly nonlinear, they lead to long polynomials and intensive computations. In other cases, approximations based on rational functions are used [4] - [6] , as they may provide good enough results with a few elementary operations. As a general rule, in this kind of approximations, the division operation is the main contributor to their computational cost, but in addition to that division, they usually require one or more multiplication operations. The architecture we propose is essentially a two-stage method, as shown in Fig. 2 . The First Stage uses a low-complexity coarse approximation for the two-input atan2(a, b)/2π. The Second Stage improves the accuracy by means of a small LUT that stores precomputed error values, as a function of the output of the First Stage. This table does not depend on the two inputs a and b of the atan2 operator and is, therefore, comparatively much smaller. As will be shown, the resulting operator is small and can compute the arctangent faster than other popular options, for the same output accuracy.
The organization of this brief is as follows. In Section II, we present the algorithm used for the atan2 approximation. Section III details the error analysis. The proposed hardware architecture and relevant implementation details are discussed in Section IV. In Section V, we present implementation results in a field-programmable gate array (FPGA), and we compare our results with those from other atan2 operators that use the same normalization.
II. APPROXIMATION FOR THE ATAN2 FUNCTION

A. First Stage: Coarse Approximation of Atan2(a, b)/2π
The proposed approximation of atan2(a, b)/2π of a complex number c = b + ja = |c|e j θ is performed in two stages. The First Stage computes a coarse approximation for atan2(a, b)/2π using a first-order Lagrange interpolation (see [7] ). In the range [−π/4, π/4], this approximation is
Using basic trigonometrical identities (see [8] ), this approximation can be extended to the full [0, 2π) range 
According to (2)- (5), atan2(a, b)/2π is approximated as (offset + f r )/4, where offset is 0, 1, 2, or 3 and f r is either a/2b or −b/2a, depending of the angle range. In Table I , we summarize the values of offset and f r for the four different angle ranges used in (2)-(5). As it is made explicit in Table I , the angle range can be identified from the signs of a + b and a − b. This coarse approximation is shown for the full [0, 2π) angle range in Fig. 3 .
B. Second Stage: Error Reduction Step for the First Stage
The First Stage described in Section II-A approximates atan2(a, b), normalized to the range [0, 1), as follows:
The error in that approximation is shown in Fig. 4 (a) as a function of the angle of the complex number c = b + ja. If this error function were stored in an LUT, it would require a large amount of storage resources, since atan2(a, b) is a two-variable function, and the LUT would have to be addressed by both the variables. However, this error can be transformed into a new error function that only depends on f r , and the coarse approximation calculated in the First Stage. This can be easily seen if, without loss of generality, we express this error for the first octant Therefore, in our proposal, we add a Second Stage that improves the accuracy of the First Stage using a tabulated version of | s1 ( f r )| addressed by | f r |. The contents of that LUT are shown in Fig. 4(b) . Since this error is periodic, only 1/8 of the values need to be stored. The schematic of the whole system is shown in Fig. 5 , which is analogous to the Kmetz/Maenner method of improving Mitchell's method [9] .
III. ERROR ANALYSIS
Since implementation results strongly depend on the target accuracy, to ensure a fair comparison of our results with those from other authors, all the implementation results that we give in this brief guarantee the same accuracy objective: for the computation of atan2(a, b)/2π, where a and b are signed values represented with w bits, the output has also w bits with last-bit accuracy (LBA): the absolute value of the difference between the output and the atan2 computed with infinite accuracy is lower than the weight of the least significant bit (LSB) of the output. Since we normalize the output to the [0, 1) range, corresponding to [0, 2π) rad, LBA means that the absolute value of the error is lower than 2 −w . Fig. 6(b) shows the scheme used to analyze the error. According to this figure, the total error is
where α is the actual output (with errors) of the operator, h() is the ideal function performed by the blocks inside of the dashed box of Fig. 6(b) , and h and c are the contributions to the total error from the operators in h() and from the errors at the input of h(), respectively
h ( f r ) can be approximated around f r , in order to obtain a bound
where f ≡ f r_a − f r . From (11), we get c ≈h ( f r ) f . We obtained empirically that 0.32 is the maximum value for |h ( f r )|, which we round to 1/3 to account for higher order error terms, so
On the other hand, the error in m, the output of the multiplier, can be expressed as
where mult ≡ m − ad and recip ≡ d − 1/b are the errors introduced by the multiplier and the reciprocal table, respectively. Since |a|<1 And the total error is bounded by
The output α is rounded to w bits, adding an error whose absolute value could be as big as 2 −(w+1) . Since LBA means that the total error is lower than 2 −w , in the worst case it should be satisfied
IV. HARDWARE ARCHITECTURE Fig. 6(a) shows the scheme of the proposed hardware architecture. It works with the absolute values of the inputs a and b, and their signs are used later in the scheme. The "sel" signal selects the operands for the division operation according to the signs of a + b and a − b, as detailed in Table I . The division required for the computation of | f r | is implemented with a table that stores 1/x and a multiplier that completes the division. A scaling stage is added in order to reduce the size of the reciprocal table. The most relevant implementation details are commented upon in Section IV-A-E. Specifically, we give details for three accuracies: w = {8, 12, 16} bit. Fig. 6(a) shows the binary formats used in different buses of the system, where s(q, t) and u(q, t) denote signed and unsigned fixedpoint formats, respectively. 2 q is the weight of the most significant bit (MSB) and 2 t is the weight of the LSB.
A. Datapath Dimensioning
If we consider a simple case where the only errors present in both LUTs are from the rounding of the stored values, the worst case errors would be recip ≤ 2 −(w+2) /6, h = 2 −(w+4) , and mult = 0. Note that truncating the output of the multiplier does not introduce an additional error unless the truncated bits are used in the following processing steps. It can be easily checked that under these simplified conditions, (16) is satisfied even for the upper bound defined by (15). In a more realistic scenario, LBA can still be achieved with smaller LUTs that either do not use all the available bits as their address word or that are implemented as bipartite tables [10] . Although in these cases the tables introduce bigger errors (as explained in the following), LBA could still be achieved, since (15) and (16) represent an upper bound that could not be reached. For this reason, we performed exhaustive tests for different LUT sizes looking for optimized implementations. Fig. 7 is an example of the error pattern obtained in one of the w = 12 operators.
B. Range Reduction
Obtaining | f r | requires the computation of y/x [see Fig. 6(a) ], which involves the computation using an LUT of 1/x. Since 1/x can be extremely large for small values of x, a scaling operation is performed on |a| and |b|. This block detects the leading-zeros in both |a| and |b|, scaling both by the same factor (2 s ) so the MSB of x, the biggest one of both outputs, is always 1. Details of this block are found in [2] and [3] . Thanks to this block, x is always in the [0.5, 1) range and the biggest possible value of 1/x is 2.
C. Computation of the Reciprocal
For the computation of 1/x, two different strategies, both tablebased, are used: direct tabulation for the w = 8 bit operator and bipartite tables for w = {12, 16} bits. In both cases, all the bits from the input word are used. Therefore, for the direct tabulation the only errors are those created by rounding the words stored in the table, and for the bipartite tables, the maximum absolute value of the error can be estimated from the second derivative of the stored function and also from the rounding errors [11] . The value stored in the first address of this table should be 2, but 2 − 2 −n+1 is stored for a table with n-bit words, so the MSB of all the stored words is the same, and therefore, it does not need to be stored.
The size of this table is 64 × 6 for w = 8, 128 × 14 + 128 × 7 for w = 12, and 1k × 18 + 512 × 8 for w = 16 bit.
D. Error LUT
The error LUT stores the values for | s1 ( f r )|. Since only the absolute value is stored, the proper sign is applied later taking advantage of an adder-subtractor. As is the case for the 1/x table, two different strategies are used: direct tabulation for the w = {8, 12} bit operators and bipartite tables for w = 16. Irrespectively of the option selected, not all the available input bits are used to address the table, and this fact introduces an additional error term related to the first derivative of the stored function [12] . Since the address input of the table is created truncating the input word, this table is always filled using values halfway in the segments [12] .
The size of this table is 64 × 5 for w = 8, 256 × 9 for w = 12, and 512 × 13 + 512 × 7 for w = 16. 
E. Multiplier
The size of this multiplier is 6 × 7, 15 × 11, and 19 × 15 for w = {8, 12, 16} bits, respectively. Although the natural implementation option for w = {12, 16} bit operators is to use a DSP48 block, for comparison purposes, we implemented logic-only versions of the required multipliers. Those multipliers were truncated by computing only the MSBs of the product and an offset value was added to round the output and compensate for the truncation. We did not explore the possibility of using an internal pipeline in those multipliers.
F. Final Rounding
To achieve LBA, i.e., an absolute error smaller than 2 −w for the selected output format, the output value has necessarily to be rounded. In the worst case, the absolute value of the rounding error is 2 −w−1 , and this error added to the combination of all the other error terms should be lower than 2 −w to guarantee our accuracy goal. In order to avoid the use of a final rounding step, half LSB (2 −w−1 ) is added to the offset value [see Fig. 6(a) ] and the output of the atan2 operator is simply truncated to w fractional bits.
V. IMPLEMENTATION RESULTS
Our proposal was modeled using VHDL language, and was implemented for w = {8, 12, 16} bits in a Kintex7 xc7k325tffg900-2 FPGA device from Xilinx. Table II summarizes the implementation results. As can be seen, we report results for the case when only inputs and output are registered (1 cycle) and also for several degrees of pipelining. It should be noted that when embedded dual-port RAMs, called block-select RAMs (BSRAMs), are used in the designs, the lowest latency that can be achieved by the operator depends on the amount of BSRAMs, since they are synchronous memories that add at least one latency cycle. We also report results for different implementation options for the required tables and multipliers. When those operators are not implemented with embedded BSRAM/DSP48 blocks, they are implemented with the LUTs available in the target device. Abbreviations used in Tables II-IV Next, we analyze the best implementation results from other authors. Gutierrez and Valls [2] and Gutierrez et al. [13] reported implementation results for LUT-based approximations with 12-bit inputs, but their results are not comparable with ours, since their achieved accuracy is only around ten bits. de Dinechin and Istoan [3] have recently reported the implementation results for different lastbit-accurate fixed-point atan2(a, b) options, including the method proposed in [2] . Their results suggest that CORDIC should be the main reference for performance comparisons.
For comparison purposes, we implemented in the same device an atan2 operator based on an unrolled CORDIC architecture designed to guarantee LBA. It was optimized by using the minimum amount of guard bits, by stopping updating the x i path in the last stages, and also by progressively cutting down the width of the y i path (see [3] ). The initial z i value used was set to half output LSB (2 −w−1 for the same normalization criterion as in our proposal) with the purpose that a truncation of the output performs a rounding operation. Table III shows the implementation results. When comparing the computational speed from our operator (see Table II ) with that of CORDIC's, as a general conclusion, our proposal can run faster than CORDIC for the same latency. For all the three input widths, we reach the fastest clock frequency supported by the FPGA device with smaller latency than CORDIC.
Although the use of resources cannot be easily compared with CORDIC's, because our proposal requires a different kind of resources (BSRAM and DSP48 blocks), we also implemented only logic versions of our operator so that comparisons could be made.
The results show that our w = {8, 12} bit only logic implementations need fewer LUTs and registers than CORDIC. For example, for w = 8 bits and two cycles of latency, our proposal is 21% faster while requiring 33% and 44% fewer LUTs and registers, respectively.
We have also implemented in our target device the RMAM degree 1 from [5] , using the FloPoCo framework provided by the authors. Their RMAM operators also produce a normalized output. The results are shown in Table IV . It should be noted that when we tested the operators provided by the FloPoCo tool (using its exhaustive TestBench) for w = {8, 9, 10, 11} bits, they were not 100% LBA-compliant, but only above 99.9%. These results confirm [5] conclusion that CORDIC is generally faster while using fewer resources than RMAM. When Tables II and IV are compared, it is seen that our proposal outperforms RMAM both in resources and speed. This result is as expected, since our error LUT stores values that are smaller and have a lower first derivative than the values stored in the RMAM. A lower first derivative means that the same error caused by truncating the addressing word can be achieved with fewer stored words.
VI. CONCLUSION
In this brief, we propose a novel method and a hardware architecture to approximate the arctangent of a complex number. We report implementation results in an FPGA device for 8, 12, and 16 bit of accuracy. Thanks to its simplicity, the proposed method has demonstrated a reduced use of resources and also good speed when compared with other approximations of akin accuracies. This operator may have practical application for systems that require either a simple approximation with minimum use of resources or improved latency and speed.
