Digital sine and cosine waves have been used in countless applications in the field of vector rotated Digital Signal Processing (DSP). The COordinate Rotation DIgital Computer (CORDIC) algorithm has become very popular due to its simplicity in catering to almost perfect digital sine and cosine waveforms during modulation and demodulation processes in DSP modules. In this paper, we have presented the design of pipelined architecture for the computation of flexible and scalable digital Sine and Cosine values using the CORDIC algorithm. The design of an application-specific CORDIC processor in circular rotation mode gives high system throughput due to pipelined architecture by reducing latency in each individual pipelined stage. Saving area on FPGA is essential to the design of pipelined CORDIC and can be achieved through optimizing the number of micro rotations. The computed quantization error is also minimized using a required number of iterations. The design has been synthesized and implemented on a Xilinx Spartan 3 device using 10.1 ISE design tool suite and results are shown and discussed.
Introduction
The COordinate Rotation DIgital Computer (CORDIC) algorithm (Volder, J.E., 1959 ) is a well-known hardware efficient iterative algorithm which allows a simple shift and adds operation to calculate hyperbolic, exponential, and logarithmic and trigonometric functions like sine, cosine, magnitude and phase with great precision (Hu, 1992; Kang and Swartzlander, Jr., 2006) . The algorithm is also well-suited for vector rotation operations like Fast Fourier Transform (FFT) (Sung, 2006; Jiang, 2007) , Discrete Cosine Transform (DCT) (Jeong et al., 2004) , Eigen Value Decomposition (EVD), etc. The same functions could have been implemented using multipliers, variable shift registers or Multiply Accumulator (MAC) units, but saving silicon area on a chip is primary criteria in VLSI technology. That is why CORDIC based hardware is preferred to MAC or multiplier based systems.
There are several ways to generate digital sine and cosine signals using trigonometric functions. The CORDIC algorithm, on the other hand, offers the opportunity to generate digital sine and cosine wave in a simple and efficient way (Kang and Swartzlander, Jr., 2006; Chen, 2008) . Sine and Cosine terms can be calculated using polynomial approximation or interpolation method using table look-up, but it has a huge drawback in implementation, where large number of gates and ROM memory is required. CORDIC offers the opportunity to calculate the desired trigonometric computation in a simple and efficient way, minimizing the required gates count. Moreover, the proposed architecture improves the performance of previous approaches with simpler design. The CORDIC architecture has been proposed is free from the internal ROM memory and sign-bit register (SBR) (Aggarwal et al., 2012) which is usually used to store the control-bits and direction of information for the number of shifts corresponding micro-rotations and directions of microrotations respectively. Sometimes, to reduce the latency of the design, either higher radix (Antelo et al., 2000) concept has been adopted or double rotation CORDIC algorithm was proposed at the cost of significantly increased resource area. As a motivation, implementation of pipelined architecture is attempted in this paper to optimize both area and latency. Though our design cannot be claimed as a best available design, its simple design, high speed, low latency and high throughput at the output stage within one primary clock cycle demands the applicability of the realization in real-time DSP applications.
The paper is structured as follows. Section 2, reviews the theoretical background of CORDIC algorithm. In section 3, a thorough description of design of pipelined architecture has been described. In section 4, various errors and optimization techniques are discussed. At the last, in section 5, hardware synthesis result has been presented.
CORDIC Algorithm: A Theoretical Review
Jack E. Volder's CORDIC algorithm is derived from general equations for vector rotation. The theory of CORDIC computation is to decompose the desired rotation angle into the weighted sum of a set of predefined elementary rotation angles, each of which can be accomplished with simple shift-add operation for a desired rotational angle θ. It can be represented for M iterations of an input vector T (x,y) setting initial conditions
i.e., the total accumulated rotation angle is equal to θ . i δ ,
, denote a sequence of ±1s that determine the direction of each elementary rotation. When M is the total number of elementary rotation angles, i-th angle i α is given by: and -1 correspond to the rotation operation in a linear, circular, and a hyperbolic coordinate system respectively. For a given value of θ , the CORDIC iteration is given by:
To bring a unit vector to desired angle θ, the CORDIC algorithm gives known recursive rotations to the vector. The known rotational values are shown in Table 1 as a PreComputed angle. Once the vector is at the desired angle, the outcome of the X and Y coordinates of the vector are equal to Cosθ and Sinθ, respectively. Let a unit vector in iteration 'i' be rotated by some angle θi, then the recursively updated equations are generated in the following form:
The above equation can be simplified and written as:
So multiplication is converted in an arithmetic right shift. Since cosine is an even function, therefore ( ) ( ). The iterative equation can be reduced to:
is known as the gain factor for each iteration. If M iterations are performed, then the scale factor, K, is defined as the multiplication of every i
The elementary functions sine and cosine can be computed using the rotation mode of the CORDIC algorithm if the initial vector starts at 
Pipelined Architecture of CORDIC
There are so many CORDIC architectures available in the literature. The pipelined architecture has an edge over others in terms of delay and throughput. Convergence is also quite good in this architecture. In this CORDIC architecture, a number of rotational modules have been incorporated, and each module is responsible for one elementary rotation. The modules are cascaded through intermediate latches (Fig. 1) . During every stage within the pipelined CORDIC architecture, only adders/subtractors are used. The shift operations are hardwired permanently to perform
reducing a large silicon area as required by barrel shifters. The precomputed values, as given in Table I , of i -th iteration angle i α required at each module can be stored at a memory location. The delay can be adjusted by using proper bit-length in the shift register. Since there is no need of sign detection for the convergence to get final outcome zero, the carry save adders are well-suited in this architecture. The use of these adders reduces the stage delay significantly. With the pipelining architecture, the propagation delay of the multiplier is the total delay of a single adder. So, ultimately, the throughput of the architecture is increased manyfold . It is obvious that if we increase the number of iterations, the latency of the design also will increase significantly. If an iterative implementation of the CORDIC were used, the processor would take several clock cycles to give output for a given input. But in the pipelined architecture, it converts iterations into pipeline phases. Therefore, an output is obtained at every clock cycle after pipeline stage propagation. Each pipeline stage takes exactly one clock cycle to pass one output. The simulated output for digital Sine/Cosine has been shown in Fig. 6 .
Overflow Control
The most recurrent problems for a CORDIC implementation are overflow. Since the first tangent value is 1 2 0 = , the rotation range will be  
. The difference in binary representation between these two angles is one bit. Overflow arises when a rotational angle crosses a positive right angle to a negative one. To avoid overflow, an overflow control is added. It checks for the sign of the operands involved in addition or subtraction and the result of the operation. If overflow is produced, the result keeps its last sign without affecting the final result. In the overflow control, the sign of i z determines whether addition or subtraction is to be performed.
Errors and Optimizations
Theoretically, CORDIC realization requires an infinite number of iterations to give accurate results. But practically CORDIC realization is restricted using a finite number of iterations and finally as an outcome, approximation error remains. Angle and finite word length errors are part of the approximation error.
Angles Error
The convergence property of CORDIC design is vital in various DSP applications. The shift sequence i.e., it is not possible to represent the arbitrary rotation angle θ without error. So the angle approximation error can be defined as:
ε is the residual angle to be rotated after completion of the CORDIC iterations. For any given rotation, the desired angle approximation error is:
. Fig. 2 shows the convergence of CORDIC with various micro-rotations. At lower micro-rotational value, the error is very high, in other words, it can be said that the convergence is very low. The error is minimal at approximately 17 micro-rotations. At this position we get the optimum region of convergence. 
Truncation Errors
The truncation error is due to finite word length effect. If the internal word length of the CORDIC has a finite number of bits in the fractional part, the quantization error including scaling error can be shown by plotting number of bits (b) and number of iteration (M), as shown in Fig. 3 . 
Hardware Synthesis
The most important part of VLSI design is design optimization in terms of speed, power, resource utilization and delay, etc. The proposed architecture design was synthesized on a Spartan-3-based xc3s50pq208-5 FPGA device using XILINX ISE 10 and simulated on ModelSim. Fig. 4 The total quiescent power consumed by the design was 0.096 Watt. 
Conclusion
In this paper, a pipelined CORDIC architecture has been proposed for digital sine and cosine wave to support modulation and demodulation in various DSP modules. Compared to other technique, the advantage of the architecture is that the internal critical path is equal to a single adder and as a result a high throughput is maintained. To enhance convergence rate and at the same time minimize angle approximation error, the numbers of micro-rotations have been adjusted. To reduce the total quantization error including scale factor error, the pipelined CORDIC architecture has been optimized. It allows maximum quantization accuracy within permitted word length. The inherent issue of overflow has been resolved. The CORDIC has been implemented in pipeline to avoid iterative cycles so that an output can be presented on each clock cycle.
