Three approaches for computing sines and cosines on FPGAs are studied in this paper, with a focus of highthroughput pipelined architecture, and state-of-the-art implementation techniques. The first approach is the classical CORDIC iteration, for which we suggest a reduced iteration technique and fine optimizations in datapath width and latency. The second is an ad-hoc architecture specifically designed around trigonometric identities. The third uses a generic table-and DSP-based polynomial approximator. These three architectures are implemented and compared in the FloPoCo framework.
INTRODUCTION
Sine and cosine functions are quite pervasive in signal processing applications. We are interested here in comparing the efficiency on FPGA of several approaches that compute the sine and/or the cosine of a number in fixed point.
Some application only require the sine or the cosine of a value at some point, but a fused sine-and-cosine implementation is interesting for several reasons. First, many applications require both, for instance to implement rotations. Second, many methods compute both anyway. Indeed, most such methods are derived from the following identity on complex exponential: e j(a+b) = e ja × e jb (1) or its real version, which is used in practice:
sin(a + z) = sin(a) cos(z) + cos(a) sin(z) cos(a + z) = cos(a) cos(z) − sin(a) sin(z)
Identity (1) is best understood as follows: a rotation of angle a + b is obtained as the composition of the rotation of angle a and the rotation of angle b. It can be applied several times. The sine and cosine of a given angle α are eventually obtained by rotating the complex point 1 by the angle α.
Overview of existing algorithms
The well known CORDIC algorithm, due to Volder [11] , is a classical technique to compute both sine and cosine using such a decomposition into micro-rotations. These rotations are chosen in such a way that (2) can be computed only using additions and shifts. CORDIC has been present in This work was presented in part at the international symposium on HighlyEfficient Accelerators and Reconfigurable Technologies (HEART2013), Edinburgh, Scotland, June 13-14, 2013.
vendor core generators almost from the start, well studied on FPGAs [1] , and still, FPGA implementations are submitted every year to FPGA conferences, with very little or no novelty compared to the original article by Volder. Many CORDIC variations have been designed over time, see for instance Muller's textbook [8] or an excellent review written for the 50 years of CORDIC [7] . However, mostly due to the fact that standard additions are accelerated by fast-carry logic on FPGAs, the preferred implementation remains the simplest one [10] . Roughly speaking, to compute w-bit sine and cosine, it requires 3(w + g) additions of size w + g where g = log2w + 1 is a number of guard bits ensuring last-bit accuracy.
CORDIC only uses the configurable logic of FPGAs, but modern FPGAs have been enhanced with embedded memories and multipliers (or DSP blocks). These resources can also be used to evaluate sine and cosine, using Equation (2) with a coarser decomposition of an input angle into a sum of two smaller angles. A good decomposition of α is α = α h +α l where α h is the number formed from the k leading bits of α, and α l is formed of the remaining lower bits, so that α l < 2 −k . This enables the use of an acceptable table for sin(α h ) and cos(α h ), and a small degree polynomial, typically Taylor, for sin(α l ) and cos(α l ). This idea will be exploited in Section 4. However, other decompositions have been suggested [6, 8] .
A variant of CORDIC, called "reduced iterations CORDIC" [9] , uses a similar idea. It first rotates the w-bit input angle using roughly w/2 CORDIC microrotation. The remaining angle z is then smaller than 2 −w/2 . Its sine and cosine can be approximated (with good enough accuracy) as sin z ≈ z and cos z ≈ 1. In both cases the approximation error is smaller than 2 −w . This ensures that a final rotation of angle z can be computed, using (2), with just two small multiplications (inputs of size roughly w/2) and two additions.
The purpose of this article is to survey all these techniques for the specific context of modern FPGAs with embedded memories, embedded DSP blocks, and large LUTs.
Specification
In this paper we compute a fixed-point approximation of sin(πx) and cos(πx), where x is a signed (two's complement) number on w bits in [−1, 1). This specification is quite natural, as it maps the modulo-2π trigonometric periodicity to the modulo-2 reduction behind the two's complement representation. In other words, using such an implementation, we can compute sines and cosines of numbers in a wider fixed-point range, for instance having k integer bits, by simply dropping these bits: this corresponds to the periodicity of the sine.
However, the range [−1, 1) is not symmetric in two's complement fixed-point: −1 is representable, but not 1. On the input side, this asymmetry is not a problem, it actually matches well the periodicity of the input. On the output side, however, there is, classically, a small problem: we want a fixed-point result on n bits, so if we want to avoid overflow, the value 1 must be actually represented as 0.11..112 = 1 − 2 −w . By symmetry, the output value −1 should be represented as −1+2 −w , not as the (representable)
Therefore, what a w-bit trigonometric operator should compute is the function (1 − 2 −w ) sin(πx) or (1 − 2 −w ) cos(πx). This is not a problem in practice. First, applications of this operator can always reduce the corresponding systematic error by increasing the precision w. Second, the sequel will show that for all proposed architectures, the scaling factor (1 − 2 −w ) can be computed at no cost. The implementations described in this article all provide the same numerical quality: all the operators studied in the following are faithfully rounded (another wording is last-bit accurate): they return a value whose error with respect to the true mathematical value is provably smaller than the weight of the least-significant bit of the result. This ensures that all the bits of the result are useful. Indeed, any error larger than that would mean that we output useless bits, which would waste routing resources. All the synthesis results correspond to architectures that have been extensively tested against this accuracy specification.
We focus on fixed-point for two reasons. Firstly, with its output in [−1, 1] and the periodicity property on its input, a trigonometric function should, in most applications, be considered a fixed-point object. In other words, we claim that most applications involving a floating-point sine or cosine will benefit, when implemented on FPGAs, from a fixedpoint conversion of the datapath around these functions. Secondly, we have studied in detail elsewhere [5] the overhead of accurate floating-point implementations of sine and cosine. Most of the present work is also relevant to such floating-point implementations.
Finally, we focus on fully pipelined architectures able to produce one result per cycle. Two of the three techniques studied here could lead to sequential implementations that take several cycles to compute one result, but these are out of scope of the present article.
Outline and contributions
Section 2 will overview the common (and classical) argument reductions that benefit to the three techniques presented in this article. Section 3 focuses on CORDIC for FPGAs. Several minor contributions (a proper error analysis, a flexible pipeline, a reduction of some datapath widths, an implementation of the reduced iteration technique) sum up to a state-of-the art implementation of unrolled CORDIC. Section 4 presents an original rotation architecture targeting modern FPGAs. It is based on tables (matching embedded RAMs) and multiplications (matching embedded DSP blocks). The choice of its various parameters is discussed. Section 5 presents another table-and multiplier-based architecture, this time computing only one of sine or cosine. It uses a generic polynomial evaluation technique. Section 6 compares synthesis results for a range of precisions and a range of frequency targets.
A final contribution is the availability of open-source, high quality implementations of these operators in the FloPoCo open-source core generator 1 , starting with version 2.5.0.
ARGUMENT REDUCTION
The first argument reduction (very classically) considers the three leading bits of x, which we call s (for sign), q (for quadrant) and o (for octant). We call the remaining bits y ∈ 0, 1 4 . We now may use classical trigonometric identities such as sin( π 2 − x) = cos(x) to reduce the problem to computing sin(πy ) and cos(πy ) for y ∈ 0, 
VARIATIONS ON CORDIC
The classical CORDIC iteration for computing sine and cosine is the following:
Here ci and si converge to the cosine and sine of the input y [8] .
We implemented in FloPoCo a generator of such CORDIC rotator, in unrolled (fully pipelined) fashion. In this case, the values of arctan(2 −i ) are constant inputs to the additions. Equation (4) works for radian arguments, but adapting it to our specification is simply a matter of scaling each constant arctan(2 −i ), so it entails no cost. Similarly, the 1 − 2 −w scaling is merged at no cost in the initial scaling factor of Equation (3) . In this equation, n is the number of iterations, slightly larger than w -more on this below. (4) is of the order of the value of zi when we stop the iteration. Its bound can be computed numerically, and the iteration count n is defined as the smallest integer for which this method error is smaller than 2 −w−2 . Then, we have to bound the rounding errors. All the computations are performed on w + g bits, where g is a number of guard bits (to be determined now) that will absorb the accumulation of these errors. Let us call u = 2 −w−g the value of the unit in the last place (ulp) on this datapath.
Error analysis
On the zi datapath, each iteration adds a constant that is the correct rounding of the actual arctan(2 −i ) to the precision u. This adds an absolute error bounded by u/2 = 2 −w−g−1 to the error of the previous iteration, so the accumulated error after i iterations is bounded by i.u/2.
On the ci and si datapath, we may express an error bound i (expressed in u) for iteration i by the following recurrence:
where the +1 comes from the truncation of the shifted addend (i.e. 2 −i dici). Note that this term could be reduced to 1/2 by rounding the shifted addend instead of truncating it. However this would require adders larger by one bit, eventually dividing the overall error by less than two, so saving at most one iteration. If the CORDIC operator is viewed as a square of full adders, we would win one bit on one size of the square and lose it on the other side: there is no net gain.
One observes that the rounding error bound on ci and si is always larger than that on zi. Our implementation therefore computes n numerically and uses it to determine the smallest g ensuring n.2 −w−g < 2 −w−2 . The sum of the method error and the rounding error is thus smaller than 2 −w−1 . The rounding of cn and sn to the target precision 2 −w entails another error bounded by 2 −w−1 . Thus, the overall error is strictly smaller than 2 −w , ensuring the faithful rounding property.
Reducing the zi datapath
Another minor contribution of our implementation is to compute just right the iteration on z. Indeed, all this computation is implemented in fixed point, and it can be observed that each iteration roughly removes one most significant bit to the angle zi. In other words, zi needs only be computed on w +g −i bits. This reduces the total area by about 1/6th.
In principle it could also slightly reduces the critical path: in the early iterations, the critical data dependency of one iteration to the next one is zi → di → zi+1. If we know di earlier because zi is shorter, we can start an iteration on ci+1 and si+1 before the end of the previous one: as soon as the bits of ci and si that are needed for ci+1 and si+1, we can launch the computation of ci+1 and si+1, and the carry propagations of the two additions will execute in parallel. In the later iterations, as ci and si are being shifted further, the gain is smaller, which is unfortunate because that is when di can be computed the fastest. We are currently trying to express these subtleties in FloPoCo's pipelining framework to reduce the latency of the CORDIC implementation.
We observe that synthesis tools are able to pack the ci and si lines of Equation (4) as one LUT per bit, while still using the fast-carry line. The LUT consumption reported in Table 2 is thus very predictable, close to 2.5(w + g) 2 .
Reduced iterations CORDIC
In this version, the CORDIC iteration is stopped as soon as the remaining rotation can be computed, with sufficient accuracy, by
The size of z is slightly more than w/2. We may truncate xi and yi before the products to the same accuracy, with only an additional contribution of 2 −w−g to the overall error. Thus, only two multipliers of roughly w/2 × w/2 bits are needed. In practice, the reduced iteration approach will consume only two of 18 × 18-bit signed multipliers of DSPenabled FPGAs for values of w up to 32.
The computation of the initial scaling factor must be modified accordingly, as well as the error analysis. These and other technical details can be found in the FloPoCo code.
A TABLE-AND DSP-BASED PARALLEL POLYNOMIAL ARCHITECTURE

Algorithm
Here we further split our octant angle y into its a most significant bits t, and its lower bits y red ∈ [0, 2 −(2+a) ). We use t to address a table storing sin(πt) and cos(πt). Meanwhile, the sine and cosine of πy red are evaluated by first computing z = πy red , then by using the Taylor series of sin z and cos z.
We chose Taylor series over generic polynomial approximation as in [4] , for two reasons. The first is that the sine series is odd and the cosine series is even, so their evaluation requires half as many multiplications for a given degree. The second is that up to terms of degree 4, the coefficients are powers of two, or powers of two multiplied by 1/3. Powers of two are for free, and division by 3 is very cheap on FPGAs [2] .
In this implementation we choose a such that 4(a+2)−2 ≥ w + g which implies that
It enables us to neglect terms beyond
, yielding the following formulae:
We may now use the addition formulae to recover the values of sin(πyin) and cos(πyin):
sin(πyin) = sin(πt) cos z + cos(πt) sin z
cos(πyin) = cos(πt) cos z − sin(πt) sin z
Again, this is the rotation equation (2) . Actually, cos z is always very close to 1. Therefore, rewriting sin(πt) cos z as sin(πt)+sin(πt)(cos z−1) replaces a large multiplier with a significantly smaller one and an addition. Therfore, instead of (9) and (10), we prefer to use the fol-lowing formulae:
4.2 Implementation details 1 4 − y is computed as ¬y, inducing an error of 2 −(w+g) . This avoids one overflow situation and saves a carry propagation.
The corresponding architecture is depicted on Figure 2 . Before each multiplication, we truncate the two inputs to the precision of the result: this is illustrated by the small boxes labelled T . We also use truncated multipliers, and z 2 /2 is computed using a squarer. Table 1 shows the internal precisions used on this figure for various value of the input/output precision w -this data is computed by our implementation out of w.
As we need very few bits of z 3 /6, this value can be simply tabulated for small sizes of z 3 /6 (currently up to 10 bits). Otherwise, we designed an ad-hoc architecture that computes z − z 3 /6 (the dashed box of Figure 2 ) inside a single bit heap [3] . Only a few bits need to be divided by 3, using a variation of [2] . An example of bit heap thus obtained is shown on Figure 3 . 
Error analysis
We again have a fixed-point architecture which loses up to 15 ulps to rounding/truncation errors: one half-ulp for each table, one ulp for each truncation T and for each truncated multiplier or squarer, and one ulp for the initial computation of 1/4 − y by xoring. Therefore, the value of g that enables last-bit accuracy is 4 -it no longer depends on w. The data in Table 1 includes these guard bits.
ARCHITECTURE USING A GENERIC POLYNOMIAL EVALUATOR
The last option we have explored is a wrapper of the generic polynomial evaluator presented in [4] in the range reduction of Section 2.
In addition to the argument x, this operator inputs a bit sincos that determines if the sine or the cosine should be computed. Here x is only reduced to one quadrant: y ∈ [0, 1/2). The polynomial evaluator computes sin(πy ) for y ∈ [0, 1/2). The proper output is reconstructed out of sin(πy ) and the bits s, q and sincos. 
RESULTS AND DISCUSSION
In Table 2 , SinAndCos denotes the combined sine/cosine operator presented in Section 4, SinOrCos is the singleoutput operator of Section 5 for which d is the degree of the polynomial approximation. All the results are for Virtex-5(5vfx100tff1738-3), using ISE 12.1. Very comparable results can be obtained for other Xilinx or Altera targets. We can obtain various frequency/latency trade-offs by varying the -frequency option to FloPoCo.
The CORDIC results match those reported for the corresponding Xilinx logicore operator [12] in parallel mode. The reduced iteration approach is very attractive, halving the iteration count at the cost of two DSP blocks only for up to 32 bits.
A first question one may ask is: on an FPGA without DSP blocks, do the multiplicative approaches makes sense? A partial answer is obtained by comparing in Table 3 a combinatorial CORDIC, and a LUT-based combinatorial version of the operator of Figure 2 , where the tables and the multipliers are implemented using LUTs only. These two implementations are functionally equivalent (same inputs, same outputs, same accuracy). For the soft multipliers, we use the FloPoCo implementation of truncated soft multipliers in order to benefit from truncation. We observe that the two approaches consume comparable resources. However, the multiplier-based approach drastically reduces the latency. We conclude that there is little point in using unrolled CORDIC in this context. However, CORDIC is still relevant in its iterative implementation (there is no iterative version of the operator of Figure 2) .
The DSP-based approaches are increasingly relevant as precision increases. Their frequency doesn't match yet that of the CORDIC operators. This is not due to an intrinsic limitation of the algorithms used, but to the more complex pipeline construction. The current FloPoCo framework is not very good at exploiting the internal registers of DSP blocks. This is being worked upon. All the DSP-based operators should be able to reach 400 MHz on Virtex-5, albeit at the cost of a longer latency and more registers.
When comparing SinAndCos and SinOrCos, one should remember that it takes two instances of the second to emulate the first. SinAndCos is therefore more efficient than SinOrCos, although not by a very large margin.
CONCLUSION AND PERSPECTIVES
This article is a survey of sine and cosine evaluation on modern FPGAs with DSPs and embedded memories. It evaluates two relevant variations on the classical CORDIC, and two DSP-based techniques. This work is probably the state of the art, although we still expect performance improvements. At any rate it exposes a lot of trade-offs between performance (frequency and latency) and resource consumption (logic, DSP, memories).
Short-term future work includes more tuning. Floatingpoint versions of the trigonometric functions should also be provided, using similar techniques [5] .
The technique presented here will scale well to double precision. However, the ROM size will then becomes a problem. An obvious solution is to decompose the input further and compose more rotations, at the cost of more multiplications. An alternative approach is to address the table with fewer bits, but allow larger degree polynomials for sin z and cos z, which also costs more multiplications. Which solution wins is not obvious.
As such questions illustrate, there are more fundamental questions behind the search for efficient architectures on today's FPGAs: How many bits does one need to flip and move in order to compute a sine faithful to w bits, and at what hardware cost? Asking this question properly requires to model the compared costs of FPGA logic, embedded memories, DSP blocks, etc. Then, such a question is essentially answered by exhibiting and comparing architectures, as we have done in this article. 
