Abstract-Linear (order-1) function evaluation schemes, such as bipartite and multipartite tables, are usually effective for lowprecision approximations. For high-output precision, the lookup table size is often too large for practical use. This brief investigates the so-called (M, p, k) scheme that reduces the range of an input argument to a very small interval so that trigonometric functions can be approximated with very small lookup tables and a few additions/subtractions. An optimized hardware architecture is presented and implemented in both a field-programmable gate array device and standard-cell-based technology. Experimental results show that the proposed scheme achieves more than a 50% reduction in total chip area compared with the best linear approach for a 24-bit evaluation. 
I. INTRODUCTION
T RIGONOMETRIC functions are extensively used in computer graphics, digital signal processing, communication systems, and robotics, to name a few fields of application. Hardware function generators are usually desirable because of their major advantages in speed over software solutions and their potential to save power by avoiding the use of hundreds of general-purpose instructions.
In 1984 Sunderland et al. [3] considered approximating a 12-bit sine function in hardware with an input argument x less than π/2 with the use of tables. They proposed to evenly split x (in binary form) into three 4-bit subwords, i.e., x = x 0 + x 1 + x 2 , where x 0 < π/2, x 1 < 2 −4 π/2, and x 2 < 2 −8 π/2, and use the following equation:
sin (x 0 + x 1 + x 2 ) ≈ sin(x 0 + x 1 ) + cos(x 0 ) sin(x 2 ). (1) By doing so, instead of using one large table with 12 address bits, two small tables, each with 8 address bits, are needed: one for sin(x 0 + x 1 ) and one for cos(x 0 ) sin(x 2 ). In 1995 Sarma and Matula [4] An enhanced scheme was then proposed by Dinechin and Tisserand [6] , who divided the input argument into more than three subwords, so that multiple small tables can be built. The scheme is thereby called the multipartite table-lookup method (MTM). Both SBTM and MTM are linear (order-1) approximations that possess an inherent limitation: to evaluate functions with high precision, the input argument needs to be reduced to very small values [2] . Very recently, Brisebarre et al. [1] have proposed a new scheme called (M, p, k)-friendly points to approximate trigonometric functions by using two small bipartite tables and only a few additions. However, no analysis of the error bound or practical hardware implementations were conducted to evaluate the effectiveness of this scheme. Low and Jong [2] revisited the tables-and-additions strategy and successfully proposed an integrated add-table lookup-add (iATA) approach, which could save 20%-60% of memory compared with MTM. Other approaches, such as lookup table cascades [7] and order-2 piecewise polynomial approximation and interpolation approaches [8] , are also studied in the literature.
In this brief, we investigate and develop implementations of the (M, p, k) scheme. Section II introduces the proposed algorithm, rectifies prior incorrect definitions, and performs an analysis of the error bound. Section III describes an optimized hardware architecture that efficiently implements the proposed algorithm. Section IV provides the field-programmable gate array (FPGA) and application-specified integrated circuit (ASIC)-based implementation results. Section IV concludes this brief.
II. (M, p, k) SCHEME

A. (M, p, k)-Friendly Points and Angles
Given an n-bit input angle x in the range of [0, π/2), we aim to evaluate the trigonometric functions sin(x) and cos(x) with p-bit output precision. Assuming thatx approximates x (we will see later howx is obtained), we define θ = x −x. When θ is much smaller than π/2, sin(θ) and cos(θ) can be efficiently approximated with high accuracy with the use of the 1549-7747 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. previously mentioned table-and-addition-based schemes. We can then obtain sin(x) and cos(x) in the form of
Instead of directly implementing (2) in hardware with the use of four multipliers, the (M, p, k) scheme uses a different way to perform these computations. The scheme seeks special pairs of numbers a, b, and z, which satisfy cos(x) = a · z and sin(x) = b · z. If a and b are bounded by an integer M and z = 1/ √ a 2 + b 2 has no more than k nonzero bits when recoded into the canonical form, which contains no adjacent nonzero digits [10] , one can implement (2) 2) The number
can be written in its canonical form
where function rnd(x, n) rounds variable x to its nth fractional bit, m and e are integers, z i ∈ {−1, 0, 1}, and the number of nonzero z i 's is less than or equal to k for i = 1, 2, . . . , p + m + 2. Note that the definition of z is more accurate than that of [1] .
B. Tabulating the Desired Points and Angles
To implement (2), one needs to tabulate the appropriate a, b, z, andx in a lookup table. The search for the desired points is a trial-and-error process: First, the input x = x 0 · x 1 x 2 · · · x n−1 is split into two subwords, i.e., the higher (r + 1)-bit part (as the table address)
where r is an integer. The angle x T divides the input range [0, π/2) into smaller regions. Second, we select one (M, p, k)-friendly point (for given M and k) whose angle is closest to x 0 .x 1 x 2 x 3 · · · x r−1 x r 1 in each region and denote it byx T . The largest distance betweenx T and
, then all the x within that region will be at a distance
fromx T . Consequently, the value of θ = x −x T has an absolute value less than 2 −r . If there are no suchx T found, M and k will be enlarged, and another round of searching is performed. As long as r is sufficiently large, we can build small bipartite tables to generate sin(θ) and cos(θ) with high-output precision. Table I provides a numerical example of the selected (M, p, k)-friendly points and angles for n = p = 24, m = 8, k = 7, and r = 7. The search process, which is performed offline, has a time complexity of O(M 2 ) for given r.
C. Computation Steps of the Algorithm
Assuming that the correct (M, p, k)-friendly points (a, b) and anglesx T are stored in a lookup table T0, the following computation steps are used to implement (2). 1) Subtraction of θ = x −x T is performed, and the values of sin(θ) and cos(θ) are looked up in two bipartite tables. 2) The following is computed:
Assuming that a and b are stored in radix-4 with a digit set {−2, −1, 0, 1, 2}, each multiplication can be implemented in m/2 additions/subtractions; 3) S · z and C · z are performed to implement (2) . Because z is in the canonical representation and the number of nonzero bits is bounded by k, the two multiplications can also be simplified to 2(k + 1) additions/subtractions.
We note that the efficiency of the proposed scheme relies on how small θ can be for not-too-large values of parameters M and k. Generally, for a given n and p, multiple pairs of parameters can be selected. For instance, we list several possible values of the parameters for n = 24 in Table II . The optimal choice of the parameters is determined by the hardware cost and performance. A scheme is proposed in Section III-B to address this design issue. 
D. Error Bound
By denotingz, sin(θ), and cos(θ) as the exact values of 1/ √ a 2 + b 2 , sin(θ), and cos(θ), respectively, three truncation errors can be expressed in e z = (z −z), e s = [sin(θ) − sin(θ)], and e c = [cos(θ) − cos(θ)]. Then, the total approximation error entailed by the proposed algorithm is bounded by
where cos(x) is the exact value of the final result obtained by infinite precision computations. From (3) we already know that |e z | < 2 −(p+m+3) . If both outputs of the sine and cosine bipartite tables have a maximum absolute error of 2 −(p+2) , the absolute value of the total error is bounded by 2 −p . Therefore, the precision of the results is only determined by the parameters n and p. Fig. 1 shows the top-level hardware architecture that implements the proposed algorithm. Table T0 stores the precomputed (M, p, k)-friendly points (a, b) , anglesx T , and associated z. SBT0 and SBT1 are two small bipartite lookup tables that generate sin(θ) and cos(θ). In Table T0 , the m-bit integers a and b are recoded in radix-4 with digit set {−2, −1, 0, 1, 2}. To reduce the area of shifters used, we introduce another constraint on z, which enforces that the nonzero z i are distributed in regions with fewer bits. Fig. 2 shows an example that z is divided into the lower (p + m + 2)/2 -bit and higher (p + m + 2)/2 -bit subwords, and the new constraint requires that the number of nonzero z i 's in the two subwords is not larger than k/2 and k/2 , respectively. Usually, a new round of search forx T should be performed. The value of m might need to be enlarged to find a sufficient number ofx T 's.
III. HARDWARE ARCHITECTURE
A. Proposed Architecture
After recoding, z has (k + 1) fields. The first field (Field-0) has log 2 (e) bits and is reserved for the leading "1." The other fields either have log 2 ((p + m + 2)/2) + 1 bits or log 2 ((p + m + 2)/2) + 1 bits, in which one sign bit is added at the most significant bit. Those fields that have no corresponding nonzero z i are filled with all "1," which represents multiplication by zero.
The multiple generation module (MGEN) generates m/2 "multiples" by implementing a simple logic as described in [11] . The multioperand adder (MOPADD) has a compressor tree structure based on [3:2] carry-save adders when standardcell-based implementations are targeted. For FPGA implementation, an [N:3] redundant adder structure is proposed in this brief to efficiently utilize the carry-propagate adders (CPAs) and ternary adder structures on FPGAs with dedicated carry chains that can provide carry propagation by more than one order of magnitude faster compared with the use of general logic resources [14] . Fig. 3 shows an example of a [9:3] compressor for the proposed redundant adder structure. The compressor consists of two diagonally deployed linear arrays (only one is shown in Fig. 3 ) of CPAs. The structure can be efficiently mapped into two arrays of CPAs, in which the fast carry chain (blue lines) is efficiently utilized. The proposed redundant adder significantly reduces the critical path delay of the proposed architecture. The right-shift barrel shifter has an internal structure that merges two consecutive levels of 2-to-1 shift operations into a single stage [15] . Efficiently mapping each stage into 4-to-1 multiplexors can reduce critical path delay without increasing the shifter area. Sign extension is performed after shifting.
B. Selecting the Optimal Parameters
From Section III-A, we know that the total size of table T0 and two bipartite tables can be estimated by
where S T 0 and S SBT denote the memory cost in bits. Applying the precomputed parameter values in Table II , we draw the diagram of Fig. 4 to illustrate the variation trend of the lookup table size with parameter r. An optimal value for r to minimize the memory usage can be observed. Similarly, one can estimate the area of the multioperand adders and shifters by
where S ADD is calculated in numbers of FA used, whereas S SFT is counted in numbers of 2-to-1 multiplexors. NR denotes the ratio of the area of one 2-to-1 multiplexor to that of one FA. In this brief, we assume that NR = 1.6 for ASIC and NR = 1 for FPGA implementation. After the optimal r is selected, another diagram can be drawn to find the optimal m. Usually, multiple local optimal values can be considered. For the case of n = 24, they are m = 7, 9, and 11. This diagram is not presented because of limited space.
The critical path delay of the second pipeline stage is substantially affected by the value of m, whereas that of the third stage is determined by k. Table II shows that the value of k gradually increases as m decreases. Therefore, in this brief, we select the pair of m and k that minimize the value of m + k (i.e., m = 9 and k = 7 for the presented instance). [7] , and order-2 interpolation [8] . The memory resource is counted in bit, whereas other supporting arithmetic/logic units are measured in the numbers of FA. The proposed design is observed to achieve 79% and 73% reductions in memory resource compared with [6] and [7] for 24-bit evaluations, respectively. The increase in logic resources are 8.2 × and 2.5 ×. Because of the degree-2 interpolation algorithm adopted, the scheme of [8] can further compress the lookup tables used. However, due to the large multipliers used, the growth in logic resources is significant.
Two design instances with precision n = 16 and n = 24 were finally implemented. Tables IV and V report the implementation results. To enable direct comparison architectures that do not utilize lookup tables, all ROMs were synthesized into pure combinational logic cells (ASIC) or distributed ROM structures (FPGA) as suggested by [8] . The MTM designs were generated in very high speed integrated circuit hardware description language (HDL) by using the program provided by [12] and implemented in the same technologies. A balanced two-stage pipeline structure is implemented such that each stage has a similar delay to the proposed design.
Both tables show that the total area of the proposed design is almost two times larger than that of MTM for 16-bit evaluation. It is revealed that, when using the proposed algorithm to perform low-precision evaluations, reducing the lookup table size does not satisfactorily compensate for the hardware cost introduced by the extra arithmetic operations. For 24-bit evaluation, the advantage of the proposed architecture is obvious: it achieves more than 63% reduction in the total circuit area over MTM for both pipelined and nonpipelined FPGAbased implementations. It is also observed that the CPAs in the proposed architecture introduce a 42% longer critical latency. However, a simple pipelining scheme enables both scheme working at a similar frequency at the cost of a few registers. For ASIC designs, the area reduction is 51%, and the latency gap also decreases to 17%. Our approach consumes a 46% lesser chip area than the newly reported iATA [2] scheme, but their critical path delays are close.
COordinate Rotation DIgital Computer (CORDIC)-based designs are compared with the proposed scheme in Table IV . Standard parallel CORDIC structures are implemented with the Xilinx IPcore generator [13] . For both 16-and 24-bit evaluations, the proposed design achieves around 70% reduction in the critical path delay. The main contribution comes from the optimized multioperand redundant adder structure presented in this brief. Compared with the radix-4 CORDIC design [9] , our proposed architecture is both superior in critical path delay and total circuit area. Because of the limited information, however, we cannot directly compare with the schemes presented in [7] and [8] .
V. CONCLUSION
This brief has presented a new scheme to compute sine and cosine functions. The main contribution of our work includes rectifying prior incorrect definitions, conducting an analysis on the error bound and developing an optimal hardware architecture that efficiently implements the proposed algorithm. The 16-and 24-bit design instances are coded in Verilog HDL and mapped on Xilinx FPGA and 65-nm standard-cell-based technology. Comparison of our proposed work with multipartite and CORDIC-based designs shows that a considerable reduction in chip area and critical path delay is achieved for 24-bit evaluations.
