Abstract-This paper introduces a parabolic synthesis methodology for implementation of approximations of unary functions like trigonometric functions and logarithms, which are specialized for efficient hardware mapped VLSI design. The advantages with the methodology are, short critical path, fast computation and high throughput enabled by a high degree of architectural parallelism. The feasibility of the methodology is shown by developing an approximation of the sine function for implementation in hardware.
INTRODUCTION
Unary functions, such as trigonometric functions and logarithms, are extensively used in computer graphics, digital signal processing, communication systems, robotics, astrophysics, fluid physics, etc. For these high-speed applications, software solutions are in many cases not sufficient and a hardware implementation is therefore needed. Implementing a numerical function f(x), by a single look-up table is simple and fast which is strait forward for lowprecision computations of f(x), i.e., when x only has a few bits. However, when performing high-precision computations a single look-up table implementation is impractical due to the huge table size and the long execution time. Approximations only using polynomials have the advantage of being ROMless, but they can impose large computational complexities and delays [1] . By introducing table-based methods to the polynomials methods the computational complexity can be reduced and the delays can also be decreased to some extent [1] . The CORDIC (COordinate Rotation DIgital Computer) algorithm [2] [3] has been used for these applications since it is faster than a software approach. CORDIC is an iterative method and therefore slow which makes the method insufficient for this kind of applications.
II. METHODOLOGY
The methodology is developed for implementing approximations of unary functions in hardware. The approximation part is of course the important part of this work but there are sometimes two other steps that are necessary, a preprocessing normalization and postprocessing transformation as described by P.T.P. Tang [1] . The computation is therefore divided into three steps, normalizing, approximation and transforming.
A. Normalizing
The purpose with the normalization is to facilitate the hardware implementation by limiting the numerical range.
The normalization has to satisfy that the values are in the interval 0 ≤ x < 1 on the x-axis and 0 ≤ y < 1 on the y-axis. The coordinates of the starting point shall be (0,0). Furthermore, the ending point shall have coordinates smaller than (1,1) and the function must be strictly concave or convex through the interval. An example of such a function, called an original function f org (x) , is shown in Fig. 1 . 
B. Developing the Hardware Architecture
When developing hardware architecture that approximates an original function, only low complexity operations are used. Operations such as shifts, additions and multiplications are efficient to implement in hardware and therefore searched for. The downscaling of the semiconductor technologies and the development of efficient multiplier architectures has made the multiplication operation efficient in both size and execution time when implemented in hardware. The multiplier is therefore commonly used in this methodology when developing the hardware.
As in Fourier analysis [4] the proposed methodology is based on decomposition of basic functions. The proposed methodology is not, as in Fourier analysis, a decomposition method in terms of sinusoidal functions but in second order parabolic functions. Second order parabolic functions are used since they can be implemented using low complexity operations. The proposed methodology also differs from the Fourier synthesis process since the proposed methodology is 978-1-4244-3828-0/09/$25.00 ©2009 IEEEusing multiplications in the recombination process and not additions as in the Fourier case.
The proposed methodology is founded on terms of second ordered parabolic functions called sub-functions s n (x) , that when recombined, as shown in (1), obtains to the original function f org (x) . When developing the approximate function, the accuracy depends on the number of sub-functions used. 
The procedure when developing sub-functions is to divide the original function f org (x) , with the first sub-function s 1 (x) . This division generates the first function f 1 (x), as shown in (2).
The first sub-function s 1 (x), will be chosen to be feasible for hardware, according to the methodology described in (4). In the same manner the following functions f n (x), are generated, as shown in (3). 
C. Methodology for developing sub-functions
The methodology for developing sub-functions is founded on decomposition of the original function f org (x) , in terms of second order parabolic functions for the interval 0 ≤ x < 1.0 and the sub intervals within the interval. The second order parabolic function is chosen as decomposition function since the structure is reasonable simple to implement in hardware i.e. only low complexity operations such as additions and multiplications are used.
First sub-function
The first sub-function s 1 (x), is according to (4). 
Second sub-function
The first function f 1 (x), is calculated according to (2) and the result of this operation is a function which appearance is similar to a parabolic function.
The second sub-function s 2 (x), is chosen according to the methodology as a second order parabolic function, see (6). 
In (6) the coefficient c 2 , is chosen to satisfy that the quotient between the function f 1 (x) , and the second subfunction s 2 (x), is equal to 1 when x is equal to 0.5, see (7).
Thereby the second function f 2 (x), will get a shape of a lying S. When developing the third sub-function s 3 (x) , the function is to be split into two parabolic functions where the first function is restricted by function f 2 (x) to be in the interval 0 ≤ x < 0.5 and the second function is thus restricted to the interval 0.5 ≤ x < 1.0. By splitting the function we get strictly convex and concave functions in each interval. The intervals can be chosen differently but that will lead to a more complex hardware, as shown in section 3.
Sub-functions when n > 2
For functions f n (x) when n > 2, the function is characterized by the form of one or more S shaped functions. When developing the higher order sub-functions, each S shaped function is divided into two parabolic functions. For each sub interval, a parabolic sub-function is developed as an approximation of the function f n (x) in the sub interval. To show which sub interval the partial functions is valid for, the subscript index is increased with the index m, which gives the following appearance of the partial function f n,m (x). In equation (8) it is shown how the function f n (x), is divided into partial functions f n,m (x), when n > 2.
As shown in (8), the number of partial functions is doubled for each order of n > 1 i.e. the number of partial functions is 2 n-1
. From these partial functions, the corresponding subfunctions are developed. Analogous to the function f n (x), also the sub-function s n+1 (x), will have partial sub-functions s n+1,m (x). In equation (9) it is shown how the sub-function s n (x), is divided into partial functions when n > 2. 
Note that in (9), the partial functions to the sub-functions; x has been changed to x n . The change to x n is normalization to the corresponding interval, which simplifies the hardware implementation of the parabolic function. To simplify the normalization of the interval of x n it is selected as an exponentiation by 2 of x where the integer part is removed. The normalization of x is therefore done by multiplying x with 2 n-2 , which in hardware is n-2 left shifts and the integer part is dropped, which gives x n as a fractional part (frac( )) of x, as shown in (10). 
As in the second sub-function s 2 (x), the second order parabolic function is used as an approximation of the interval of the function f n-1 (x), as shown in (11). 
Where the coefficients c n,m is computed according to (12). After the approximation part the result is transformed into its desired form.
III. HARDWARE IMPLEMENTATION
For the hardware implementation two's complement representation [5] is used. The implementation is divided into three hardware parts, preprocessing, processing, and postprocessing as introduced by P.T.P. Tang [1] .
A. Preprocessing
In this part the incoming operand v is normalized to prepare the input to the processing part, according to section 2A.
If the approximation is implemented as a block in a system the preprocessing part can be taken into consideration in the previous blocks, which implies that the preprocessing part can be excluded.
B. Processing
In the processing part the approximation of the original function is directly computed in either iterative or parallel hardware architecture. The benefit of the iterative architecture is the small chip area whereas the disadvantage is longer computation time.
The advantages with parallel hardware architecture are that it gives a short critical path and fast computation to the price of a larger chip area. To increase the throughput even more, pipeline stages can be implemented in the parallel hardware architecture.
In the sub-functions (4), (6) and (11) x 2 and x n 2 are reoccurring operations. Since the square operation x n 2 , in the parallel hardware architecture is a partial result of x 2 a unique squarer has been developed. In Fig. 2 In Fig. 2 the squaring algorithm that performs the partial products x n 2 , shown. The first partial product p, is the squaring of the least significant bit in x. The second partial product q, is the squaring of the two least significant bits in x.
The partial product r, is the result of the squaring of the three least significant bits in x and s is the result of the squaring of x.
The squaring operation is performed with unsigned numbers. When analyzing the squarer in Fig. 2 , it was found that the resemblance to a bit-serial squarer [6] [7] is large. By introducing registers in the design of the bit-serial squarer the partial results of x n 2 is easily extracted. The squaring algorithm can thus be simplified to one addition only when computing each partial product.
From (4), (6) and (11) it is found that only the coefficients values differentiate when implementing different unary functions. This implies that different unary functions can be realized in the same hardware in the processing part, just by using sets of coefficients.
Since the methodology is calculating an approximation of the original function the error between the functions can be both positive and negative. Especially if the value of the approximation is less than the original function some extra bits of word length compared with the desired precision is needed to accomplish the desired precision of the approximation. If the order of the last used sub-function is n > 1, an improvement of the precision can be done by optimizing one or more coefficients c 2 in (7) or c n,m in (12). The optimization of coefficients will minimize the error in the last used subfunction and thereby it can reduce the word length needed to accomplish the desired accuracy. Computer simulations perform such coefficient optimization numerically.
C. Postprocessing
The postprocessing part transforms the value to the output result z. If the approximation is implemented as a block in a system the postprocessing part can be taken into consideration in the following blocks, which implies that the postprocessing part can be excluded.
IV. IMPLEMENTATION OF THE SINE FUNCTION
An implementation of sin(v), using the proposed methodology is described in this section as an example.
A. Preprocessing
To satisfy that the values of the incoming operand x is in the interval 0 ≤ x < 1 a π/2 is multiplied with the operand as shown in (13).
To normalize the sin(v) function v is substituted with x which gives the original function f org (x) (14).
B. Processing
For the processing part, sub-functions are developed according to the proposed methodology. For the first subfunction s 1 (x), the coefficient c 1 is defined according to (5) . The determined value of the coefficient are, c 1 = 0.570796.
The first function f 1 (x) , is computed as shown in (15).
When developing the second sub-function s 2 (x), the coefficient c 2 is defined according to (7) . The determined value of the coefficient are, c 2 = 0.400857.
The second function f 2 (x), is computed as shown in (16).
To develop the third sub-functions s 3 (x) , the second function f 2 (x), is divided into its two partial functions as shown in (8). The third order of sub-functions is thereby divided into two sub-functions, where s 3, 0 (x 3 ) is restricted to the interval 0 ≤ x < 0.5 and s 3, 1 (x 3 ) is restricted to the interval 0.5 ≤ x < 1.0 according to (9). A normalization of x to x 3 is done to simplify in the implementation in hardware, which is described in (10).
For each sub-function, the corresponding coefficients c 3,0 and c 3,1 is determined. These coefficients are determined according to (12) so that higher order of sub-functions can be developed. The determined values of the coefficients are, c 3,0 = -0.0122449 and c 3,1 = 0.0105948.
The third function f 3 (x) , is computed as shown in (17). To develop the fourth sub-functions s 4 (x) , the third function f 3 (x), is divided into its four partial functions as shown in (8). The fourth order of sub-functions is thereby divided into four sub-functions, where s 4, 0 (x 4 ) is restricted to the interval 0 ≤ x < 0.25, s 4, 1 (x 4 ) is restricted to the interval 0.25 ≤ x < 0.5, s 4, 2 (x 4 ) is restricted to the interval 0.5 ≤ x < 0.75 and s 4, 3 (x 4 ) is restricted to the interval 0.75 ≤ x < 1.0 according to (9). A normalization of x to x 4 is done to simplify the hardware implementation, which is described in (10).
For each sub-function, the corresponding coefficients No postprocessing is needed since the result out from the processing part has the correct size.
C. Optimization
Only sub-function s 4, 3 (x) in the interval 0.75 ≤ x < 1.0 has to be improved by optimization. By optimizing the coefficient the necessary word length to achieve the desired precision could be reduced from 17 bits to 16 bits. The optimized value of the coefficient is, c 4,3 = 0.0128228.
V. COMPARISON
The most common methods used when implementing approximation of a unary functions in hardware is look-up tables, polynomials, table-based methods with polynomials and CORDIC. Computation by table look-up is attractive since memory is much denser than random logic in VLSI realizations, but since the size of the look-up table grows exponentially with increasing word lengths, both the table size and execution time becomes totally intolerable. Computation by polynomials is attractive since it is ROM-less. The disadvantages are that it can impose large computational complexities and delays. Computation by table-based methods combined with polynomials is attractive since it reduces the computational complexity and decreases the delays. Since the size of the look-up tables grows with the accuracy the execution time also increases with the needed accuracy. Computation by using CORDIC is attractive since it is using an angular rotation algorithm that can be implemented with small look-up tables and hardware, which is limited to simple shifts and additions. The CORDIC algorithm is an iterative method with high latency and long delays. This makes the method insufficient for applications where short execution time is essential.
In all methods including the proposed method, it is a tradeoff between complexity and memory storage. By using parallelism in the computation and parabolic synthesis in the recombination process, the proposed methodology thereby gets a short critical path, which assures fast computation.
VI. CONCLUSIONS
A novel methodology for implementing approximations of unary functions such as trigonometric functions, logarithmic functions, etc. in hardware is introduced in this paper. The architecture of the processing part automatically gives a high degree of parallelism. The methodology to develop the approximation algorithm is founded on parabolic synthesis. This combined with that the methodology is founded on operations that are simple to implement in hardware such as addition, shifts, multiplication, contributes to that the implementation in hardware is simple to perform. By using the parallelism and parabolic synthesis one of the most important characteristics with the out coming hardware is the parallelism that gives a short critical path and fast computation. The structure of the methodology will also assure an area efficient hardware implementation. The methodology is also suitable for automatic synthesis.
