Abstract-The atan2 function computes the polar angle arctan(y/x) of a point given by its cartesian coordinates. It is widely used in digital signal processing to recover the phase of a signal. This article studies for this context the implementation of atan2 with fixed-point inputs and outputs. It compares the prevalent CORDIC shift-and-add algorithm to two multiplierbased techniques. The first one computes the bivariate atan2 function as the composition of two univariate functions: the reciprocal, and the arctangent, each evaluated using bipartite or polynomial approximation methods. The second technique directly uses piecewise bivariate polynomial approximations of degree 1 or 2. Each of these approaches requires a relevant argument reduction, which is also discussed. All the algorithms are last-bit accurate, and implemented with similar care in the open-source FloPoCo framework. Based on synthesis results on FPGAs, their relevance domains are discussed.
I. INTRODUCTION

A. Definitions and notations
In all this article we implement the function atan2(y, x) = arctan(y/x). This function (part of the standard mathematical library) returns an angle in [−π, π]. Compared to a plain composition of division and the arctan function (that returns an angle in [−π/2, π/2]), atan2(y, x) keeps track of the respective signs of x and y. It is used to compute the phase of a complex number x + iy.
We are interested in implementations of this function for fixed-point input and outputs, each on w bits. As arctan( ky kx ) = arctan( y x ), the range of the inputs is not really relevant, as long as both x and y are in the same format. We choose to have both inputs as fixed-point numbers in the range [−1, 1), represented classically in two's complement. There are two sensible choices for the output.
• It may be expressed in radian, from −π to π;
• It may be expressed in [−1, 1), in which case the function being computed is 1 π atan2(y, x). We will refer to this option as "binary angles". There are several reasons for preferring binary angles in a fixed-point context. The first is that it fully uses the representation space. An output on [−π, π] will be coded on a fixedpoint format that may represent the interval [−4, 4) . Therefore, all the codes between π and 4 will never be used.
The second is that it slightly simplifies implementation: the modulo-2π periodicity of radian angles becomes a modulo-2 periodicity that comes for free in two's complement binary arithmetic. For instance, the fact that the two's complement output range [−1, 1) is slightly asymmetrical is not a problem, as modulo-2 arithmetic will map angle 1 × π to angle −1 × π as it should. This will save some computations in the range reduction, and allow it to be exact, where radian angles require computations with the constant π/2 which cannot be represented exactly in binary.
The third reason for preferring binary angles is that it generally simplifies the application context as well. For instance, in QPSK decoding, the atan2 of incoming samples are averaged to estimate a phase shift. Computing this average is simpler with binary angles than with radian angles, again because it is a modulo operation.
For all these reasons, the focus of this article is the function (sometimes called atan2Pi)
However, all the algorithms in this paper may be straightforwardly adapted to radian angles.
B. Overview of hardware implementation methods
This paper compares several hardware implementation techniques, some classical, some new. To ensure a meaningful comparison, each technique is implemented with the same care, and with the same accuracy objective: f (x, y) is computed with last-bit accuracy (LBA), i.e. the error (the difference between the returned result and the infinitely accurate one) is less than the weight of the least significant bit (LSB) of the result. This is sometimes called faithful rounding in the literature. LBA takes full advantage of the output format, but is slightly less accurate than correct rounding. However, it can be obtained at a much smaller cost: correct rounding may require twice the internal precision [1] , which is not justified in a hardware context. Last-bit accurate architectures are also almost bit-for-bit compatible with each other (differing by at most one bit in the last place) which enables a fair comparison of methods.
The most classical technique for fixed-point hardware implementation of atan2 is CORDIC. It performs a series of micro-rotations of the point (x, y) to send it on the x axis. The angle of each micro-rotation is chosen so that it can be implemented by shift and add. CORDIC is studied in Section III, which contributes a fine error and range analysis that guarantees last-bit accuracy at the smallest possible datapath width.
One could consider tabulating all the values of atan2 in a table addressed by a concatenation of x and y. However, this table would have 2 2w entries, which is impractical even for small values of w. Besides, classical table-compression techniques (bipartite, etc) do not apply here, since they address functions of one variable, whereas f (x, y) is a function of two variables.
The next technique, studied in Section IV and illustrated in Figure 5 , therefore reduces atan2 to two functions of one variable: the reciprocal 1/x, used to compute the division z = y/x, and the arctangent, used to compute arctan(z). There is a wide body of techniques that can be used to compute functions of one variable. However, we now have the problem that z may become very large when x comes near to zero. This would require either a floating-point format, or a very large fixedpoint format for z. This can be avoided by a scaling-based range reduction that will be detailed in Section II.
The last and most original technique, studied in Section V, is the use of a piecewise approximation by bivariate polynomials. We have limited experiments to polynomials of first and second degree because the number of multipliers in bivariate polynomials explodes with the degree. Thus, atan2(y, x) will be approximated by P 1 (y, x) = ax + by + c, respectively P 2 = ax + by + c + dx 2 + ey 2 + f xy. The corresponding architectures are depicted by Figures 7 and 8. It is interesting to compare rough estimations of the asymptotic complexity of each approach with respect to the precision. It is well known that CORDIC area and delay are quadratic in the precision. The other methods are table-based and will therefore be exponential in area. However their delay should be sub-quadratic, with the bivariate approach having shorter latency, but larger tables. The main objective of this article is therefore to study the relevance domain of each method in Section VI.
This work essentially focuses on FPGAs. An unexpected result is that, even on modern FPGAs enhanced with DSP blocks and memories, CORDIC is a clear winner. This will be analyzed in SectionVI. However, this work also contributes an open-source VHDL generator that covers all the presented methods, so that comparisons can be performed in other contexts. 
C. Notations for fixed-point formats
In all the article, we will describe a signed fixed-point format by sfix(m, l) and an unsigned fixed-point format by ufix(m, l). The two integers m and l denote respectively the weights of the most significant bit (MSB) and LSB of the format. For instance our inputs and outputs in [−1, 1) on w bits will be on the format sfix(0, −w + 1): the sign bit has weight 2 0 and the LSB has weight 2 −w+1 .
II. RANGE REDUCTIONS
A. Using parity and symmetries
As arctan is an odd function, inputs may be straightforwardly reduced to the positive quadrant. Besides, there is a symmetry between x and y:
If y > x, we may therefore swap x and y, so the computation is reduced to the first octant (x ≥ 0, y ≥ 0, y < x).
Similar range reduction formulae may be used on other quadrants, one example being given by Figure 2 .
Implementing such argument reduction formulae is much easier with binary angles. Indeed, the final addition of π/2 becomes an addition of 1/2 to a number in [−1, 1) (an exact operation with only a one-bit carry propagation). Conversely, adding π/2 would require a full-width carry propagation, while entailing a systematic error due to the rounding of the irrational π/2.
The absolute values |x| and |y| require negating negative input values (the value -1, which has no opposite in two's complement arithmetic, must be saturated). This may be implemented by a full-size subtraction if we want them to be exact, or by bitwise complement (which is smaller and has lower latency) at the cost of an error on the last bit. The two options will be relevant, depending on the subsequent algorithm.
B. Scaling range reduction
Due to the division by x, the value of z = y/x can become very large. On the plot of atan2 over a quadrant (Figure 3 , left), this translates to much larger slopes when y is small.
For an approach that explicits the computation of z, or for a bivariate polynomials approximation, it is therefore interesting to avoid this region. This can be easily performed by the architecture depicted on Figure 4 
where s i = −sgn(y i ) (vectoring mode). This iteration converges as follows:
It may produce a binary angle if the constants arctan 2 −i are expressed in this format. Note that thanks to the initial argument reduction to the [0, π/4] octant, we may start from iteration 1 instead of iteration 0, which leads to a smaller value of K than found in most textbooks .
In FPGA technology, addition carry propagation is very fast compared to generic logic. Therefore we choose to ignore CORDIC variants (reviewed in [1] , [4] and [5] ) that accelerate the iteration by using carry-save or other forms of redundant arithmetic. This choice will be validated by the results in Section VI.
Also, we are interested in an unrolled implementation that, once pipelined, will produce one result each cycle.
A. Error analysis and datapath sizing
If the iteration is computed with infinite accuracy, the error on α i after i iterations is smaller than 2 −i radian [1] . To obtain an approximation error on the binary angle smaller than 2 −w (i.e. smaller than one half-ulp of our result), we may therefore stop after iteration w − 1.
Let us now discuss rounding errors, and more generally fixed-point implementation issues. Let us call p the weight of the LSB on the x i and y i datapath. It is easy to see that
646. This defines the MSB of x i : its format will be ufix(1, −p).
We also have
. This defines the MSB of y i : its format will be sfix(−i +
2, −p).
Let us now study the error of implementing the CORDIC iteration. Formally, if we notex i andỹ i the computed values we may define ε x i =x i − x i , and similarly ε y i =ỹ i − y i . The only source of error in each iteration is the error u x i and u y i due to the loss of the bits discarded in the shifts. This is writteñ
Hence the recurrence defining the accumulation of rounding errors:
If we consider a common bound ε i of ε x i and ε y i , and the bound 2 −p on u x i , we get
with ε 1 = 0, unless the opposite in the range reduction is computed by simply complementing the bits, in which case ε 1 = 2 −p . This only marginally adds to the overall error.
To build an accurate architecture, the recurrence ε i+1 = ε i (1 + 2 −i ) + 2 −p is computed up to i = w − 1. We then define p = w − 1 − log 2 ε w−1 as the precision of the x i and y i datapaths. It ensures that y i is computed with enough accuracy to ensure that rounding errors do not change its sign in a way that cannot be corrected by CORDIC.
Finally, considering that |y i | < 2 −i+1 , the term 2 −i y i added or subtracted to x i is smaller than 2 −2i+1 . Therefore, we may stop updating x i as soon as 2i − 1 > p.
On the α i datapath, the error analysis is much simpler: each arctan 2 −i , correctly rounded to the precision p α , contributes at most one half-ulp to the error. The final rounding will contribute 2 −w . We therefore need g α = 1 + log 2 ((w − 1) × 0.5) guard bits to absorb all these errors.
Here the only trick is that the final rounding may be performed by truncation, provided 2 −w has been added to one of the arctan 2 −i constants.
IV. RECIPROCAL-MULT-ARCTAN
This variant, illustrated by Figure 5 (where x r and y r are the reduced inputs), is the most natural but requires the scaling argument reduction of Figure 4 . After this scaling, we have 0.5 ≤ x ≤ 1. Since we are on the first octant where y ≤ x, we also have z = y/x ∈ [0, 1]. On such intervals, both 1/x and arctan(z) are regular (see Figure 6 ) and can be fully tabulated, or well approximated by polynomials. The main difficulty will be to define the bitwidths of r and z that will enable last-bit accuracy at the minimal cost.
Here again, the situation is much simpler with binary angles. The output of the arctangent box belongs to [0, 1/4] and is needed in the ufix(−2, −w + 1) format. Its two leading bits will be added, error-free, by the reconstruction.
A. Related work
This technique is classically used in software [6] , [7] . The division z = y/x is performed in the working precision, then a high-degree odd polynomial is used for an accurate approximation of arctan(z).
In hardware, early works [8] , [9] use piecewise linear approximations of unspecified accuracy for the computation of arctan(z) to some precision. On FPGAs, this technique is used in [2] with multipartite tables [10] to save multipliers. Comparisons to CORDIC are only given for 12-bit precision (the main claim is a near halving of power consumption). Similarly, the accuracy of the architecture is only measured for 12-bit precision (10.3 bits in the worst case). In a followup article [11] the division is transformed to a subtraction in the logarithmic domain. However the architecture now requires three function evaluations (two log conversions on the input, and one arctan(2 z )). A non-uniform decomposition is used for the evaluation of arctan(2 z ). The present section generalizes [2] to more precisions, and with more rigorous rules of the game (last-bit accuracy) and error analysis. It also uses polynomials of larger degrees, as in [12] , to approximate the unary functions when needed.
B. Error analysis
Let us define formally the errors of the reciprocal and arctangent boxes:
We have to truncate or round the product to keep the number of input bits to the arctangent box small. Therefore, the multiplier will not be exact: let us define its error
We may now define the total error on z:
As 0 ≤ y ≤ 1 we will have
For last-bit accuracy, the total error must remain smaller than u = 2 −w+1 . The total error is defined as
From this we deduce that, for z ∈ [0, 1],
where the rounding up of 1 π to 1 3 accounts for the higher order error terms. The previous inequality may also be observed on Figure 6 . Applying it to z = y/x + ε z , we get:
and finally
C. Datapath dimensioning
Let us first consider the simpler case when the two functions are purely tabulated. The arctangent table may be correctly rounded to its output format. We therefore have |ε atan | ≤ u/2, and we need |ε total | < u = 2 −w+1 (last-bit accuracy). Unfortunately, we cannot use a correctly-rounded reciprocal table, because we want its output to be in [1, 2) (1, −w) . This entails |ε recip | ≤ u.
The multiplier itself does not need to be correctly rounded. A truncated multiplier, faithful to an output format ufix(0, −w) for z will entail |ε mult | ≤ u/2. This combination of errors satisfies (1). Interstingly, however, this error bound is not reached: exhaustive tests (up to w = 12) show that LBA is still achieved with a truncated multiplier faithful to ufix(0, −w+1).
A correctly rounded multiplier adds to the cost of the multiplier (by requiring to compute and sum all its partial products), but it provides the same accuracy for z using one less bit. Saving one bit on z will reduce the size of the arctangent box, possibly halving it. This trade-off between the multiplier and the arctangent box has not been fully studied yet.
Using a black-box polynomial approximator [12] or the multipartite method [10] entails that the two functions are themselves last-bit accurate. In this case, we may no longer ensure |ε atan | ≤ u/2 at a reasonable cost, due to the Table Maker's dilemma [1] . What is possible is a more accurate implementation that will ensure |ε atan | ≤ 3u/4. We now have a u/4 error budget to distribute among the multiplier and the reciprocal table. With the same reasoning, using a ufix(1, −w−1) format for r and a ufix(0, −w) for z will satisfy (1).
This analysis has been implemented in the FloPoCo generator (FixAtan2 operator). This program builds an architectures for any size w and degree d. For d = 1, the bipartite method is currently used, and the multipartite method should be used in the near future.
D. FPGA-specific considerations
For small precisions, the following optimizations may apply on some FPGAs that embed hard memory and multiplier blocks.
The hard multipliers will compute the full product, so truncation will be faithful, and correct rounding will come at the expense of only one addition which can be mapped in the post-adder included in all recent DSP blocks. The latter is relevant if it saves one hard memory block, for instance in the situations depicted below.
Memory blocks are of fixed size (20Kb and 36Kb on current Altera and Xilinx chips respectively). They are dual-ported, which means that we can store two tables in one single block. The largest architecture that fits in a single 20Kb block is for w = 11, with a reciprocal table of 2 9 × 11 bits and an arctangent table of 2 10 × 9 bits, both packed in a dual-port block configured as 2 10 × 20. In a 36Kb block, the largest Fig. 7 . Architecture based on a first order bivariate polynomial architecture that fits is for w = 12. Then we may have a 2 10 × 12 reciprocal table and a 2 10 × 10 arctangent table. In both cases the multiplier must be correctly rounded.
V. PIECEWISE BIVARIATE APPROXIMATION
In this section, we evaluate atan2(y, x) using a piecewise bivariate polynomial approximation of degree 1 or 2. We are not aware of comparable work in the literature.
The idea is to decompose the domain of Figure 4 into a grid of square subdomains indexed by the first few bits of x and y. On each subdomain, we evaluate a bivariate polynomial in the remaing bits. In other words, let x h (resp. y h ) the number formed of the k (resp. k + 1) MSBs of x r (resp. y r ). Let δ x and δ y the numbers formed of their w−k −1 respective LSBs, such as
The bits (x h , y h ) are used to index a table of the coefficients of bivariate polynomials approximating atan2(y, x) on each subdomain (see Figures 7 and 8 ):
As illustrated by Figure 3 , such approximation benefits from the scaling range reduction of Figure 4 .
A. First order bivariate approximation
Here
A first way of obtaining a, b and c is the Taylor series around point (x 0 , y 0 ):
0 ) the centre of a square subdomain indexed by (x h , y h ), we may obtain the coefficients a, b and c.
The multiplications a · δ x and b · δ y are of reduced size, as
The coefficients a, b and c can also be obtained using a different approach. A degree-1 bivariate polynomial is a plane. Three points from the surface of f define a unique plane that goes through these three points. This plane is a bivariate degree-1 approximation of the function. Therefore, if we chose three points of the form (x, y, f (x, y)) with (x, y) inside the square subdomain, the equation of the plane defined by these 
B. Second order bivariate approximation
The approach is very similar to the previous one. The motivation for using the second order is to provide a better approximation, which will enable a smaller value of k, hence a smaller coefficient table. We approximate atan2(y, x) as:
The coefficients a through f are again obtained by using the Taylor series for atan2(y r , x r ), around a point (x 0 , y 0 ):
Replacing f by atan2(y r , x r ), and performing the variable change x r = x h + δ x and y r = y h + δ y, we get:
. Error analysis
We limit the discussion to the approximation of atan2(x r , y r ) by the second order Taylor polynomial, as the case of the first order method is simpler. Our goal is to produce a result which is within 1 unit in the last place (ulp) from the exact mathematical result, meaning:
where ε total is the total error and:
The magnitude of ε method , the method error, is determined by the parameter k. Limiting the approximation to the second order terms results in:
where the c x i 's and c y i are obtained by expanding the Taylor series. ε method is bounded by:
c x max , c y max and c xy max being the maximum values of c x , c y , and c xy . Equation 2 determines a bound for ε method , which, in turn, determines the value of k, that must satisfy k > w 3 . But this is a pessimistic bound. In practice, we set the value of k more optimistically, then fill the table, and compute ε method in each point of the domain using the actual coefficient values. If the error is above budget, k is increased. Using the actual values for the coefficients provides a tighter bound on ε method .
The error due to the final rounding ε f inal round can be limited to 1/2ulps by rounding the final result to the output accuracy.
Each term of T 2 (x r , y r ), except for c, brings to the final sum an error due to the truncation/rounding of the multiplications/squares, and an error due to storing the real coefficients in the table. The expressions for the error terms are given after the variable change. For a · δ x, the error is:
The same holds for b · δ y term. The error due to c in the final sum comes from storing the coefficient in the table:
For the second order terms, the error can be found in a similar manner:
The same holds for e · δ y 2 . Lastly, for f · δ x · δ y the error can be expressed as:
This gives the expression for ε round :
Let us assume that the errors due to storing the coefficients in the tables are all equal to ε table and that the errors due truncation/rounding of multiplications and squares are ε mult and ε sqr respectively. Thus, ε round becomes:
D. Datapath dimensioning
As in the case of the previous sections, the error analysis ensures that the final result is within less than 1 ul p of the mathematical result. It also allows us to create architectures that compute with the minimal amount of resources.
Thus, let us analyze the required number of additional guard bits g due to ε f inal round and ε round . Due to ε f inal round we must extend the precision of our computations to −w (from −w+1). The contribution due to ε round is worth the discussion. From equation 3, ε round is influenced by the precision of the multiplications and squares. It is also influenced by precision of the stored coefficients, which depends on our choice for the k parameter. This is due to the dependence of ε round to δ x and δ y, which satisfy δ x < 2 −k and δ y < 2 −k .
When deciding the value of g, we attempt try to strike a balance between the size of the multiplications and that of the coefficient tables.
VI. RESULTS AND DISCUSSION
This section presents synthesis results obtained for Virtex 6 (6vhx380tff1923) using ISE 14.7. Tables I and II show results when the multipliers as well as the tables are synthesized purely in logic (in FPGA LookUp Tables, or LUT) . In this case, we use faithful truncated multipliers to save resources and latency. Degree 0 uses plain tabulation, degree 1 uses a bipartite approximation [10] , and degree 2 and above use a Horner scheme inspired by [12] . The squarers of Figure 8 also use bipartite approximation (which saves resources over a dedicated squarer for the small precisions needed here).
A. Logic-only synthesis
The objective of these tables is to compare the methods in absolute terms. A first observation is that CORDIC behaves extremely well on modern FPGAs. The multiplier-and table-based methods never perform better in area, and only rarely beat its latency. This is especially disappointing as the architecture of Figure 8 exhibits a lot of parallelism. The comparison between CORDIC and the RecipMultAtan method is consistent with the findings of [2] .
The implementation of combined sine and cosine (which can be viewed as the inverse function of atan2) was studied in [13] . Interestingly, it lead to the opposite conclusion: for sine/cosine, it was shown that CORDIC had longer latency than multiplier-based methods even when the multipliers were implemented in logic.
The bivariate approximation methods could be compared to a bivariate interpolation technique used in [14] . The technique used there resorts to two sequential interpolation processes (on x, then on y). The results reported in that work (extrapolated from a different FPGA family, Altera Stratix II), tend to show that this bivariate interpolation technique has longer latency and consumes even more resources, mainly due to the high memory requirements.
It is interesting to re-assess the complexity asumption made in the introduction in the light of these results. In principle, classical (i.e. non-redundant) CORDIC is quadratic both in area and delay: The CORDIC algorithm in precision w consists of about w iterations, each consisting of three add/sub operations of about w bits, hence the quadratic area. Besides, there is a carry propagation in each iteration, whose LSB input depends on the MSB of the previous iteration: there is a critical path of size w 2 .
On the one hand, Table I indeed exhibit this quadratic area. However, the constant is very small. Specifically, we can observe that the area of CORDIC is roughly 3w 2 . This means that each CORDIC iteration is indeed implemented in 3 LUT per bit. In other words, a multiplexed addition and subtraction is mapped to a row of LUTs, and therefore consumes no more than a simple addition.
On the other hand, the latency of CORDIC does not seem quadratic, it seems linear in w. This is explained by the fact that the carry propagation delay is about 30 times faster than the standard routing used between two iterations. It justifies a posteriori the choice of ignoring redundant versions of CORDIC. A second observation is that the degree 1 bivariate approximation is never interesting. Its size explodes too fast, and this even impacts the latency.
The scaling-based argument reduction is quite small and fast, as illustrated by Table II . In [2] , the shifts were implemented as one-hot encoding then multiplication in a DSP block. This is probably overkill.
B. Pipelining, DSP-and table-based results
Small multipliers (or DSP blocks) and memories are embedded in all recent FPGAs. Table III shows some synthesis results for pipelined operators at various frequencies, exploiting these resources when possible. The multiplier-based methods do indeed improve logic count, but bring no clear advantage in terms of latency nor frequency compared to a pipelined unrolled CORDIC. This probably reflects the current weakness of FloPoCo at managing the pipelining of such entities.
Still, we again observe the excellent match of CORDIC to FPGA logic: as the third line of TableIII shows, it is possible to pack two 16-bit iterations in one pipeline stage operating at nearly 400MHz.
Finally, these results should not be extrapolated to ASIC. There, without carry propagation and large LUTs, CORDIC will appear much less favorably.
VII. CONCLUSION AND FUTURE WORK
This article compares several methods for the evaluation of the atan2 function. The most novel method, based on piecewise bivariate polynomials, does reduce the latency as much as expected, at least on FPGAs. There are many improvements to bring to this method, the first being to reduce k using degree-2 approximation technique that are more accurate than Taylor. Unfortunately we are not aware of a Remez or minimax polynomial approximation algorithm for functions of two inputs. The problem is that the alternation property on which Remez is based [1] is difficult to transpose in two dimensions. Current work also focuses on improving the pipelining of the multiplier-based methods to make the best use of the FPGA embedded resources.
To make things even better for CORDIC, it should be noted that it may also compute the module x 2 + y 2 along with the angle [1] . This costs only one additional constant multiplication by 1/K.
Among the possibilities to explore around atan2, [15] decomposes the computation into two successive rotation, a coarse one and a finer one. Both first approximate z = y/x, then compute atan2(z). This provides a blend between CORDIC and multiplicative methods.
A table-and multiplier-based combined range reduction and scaling method could enable the bivariate techniques to scale better, but again at the expense of latency.
