One of the most important steps in spectral analysis is filtering, where window functions are generally used to design filters. In this paper, we modify the existing architecture for realizing the window functions using CORDIC processor. Firstly, we modify the conventional CORDIC algorithm to reduce its latency and area. The proposed CORDIC algorithm is completely scale-free for the range of convergence that spans the entire coordinate space. Secondly, we realize the window functions using a single CORDIC processor as against two serially connected CORDIC processors in existing technique, thus optimizing it for area and latency. The linear CORDIC processor is replaced by a shift-add network which drastically reduces the number of pipelining stages required in the existing design. The proposed design on an average requires approximately 64% less pipeline stages and saves up to 44.2% area. Currently, the processor is designed to implement Blackman windowing architecture, which with slight modifications can be extended to other widow functions as well. The details of the proposed architecture are discussed in the paper.
Introduction
Window filtering techniques [1, 2] are commonly employed in signal processing paradigm to limit time and frequency resolution. Various window functions are developed to suit different requirements for side-lobe minimization, dynamic range, and so forth. Commonly, many hardware efficient architectures are available for realizing FFT [3] [4] [5] , but the same is not true for windowing-architectures. The conventional hardware implementation of window functions uses lookup tables which give rise to various area and time complexities with increase in word lengths. Moreover, they do not allow user-defined variations in the window length. An efficient implementation of flexible and reconfigurable window functions using CORDIC algorithm is suggested in [6, 7] . Though they allow user-defined variations in window length, latency is a major problem. The CORDIC algorithm [8] [9] [10] inherently suffers from latency issues and using two CORDIC processors in series, as is done in [6, 7] ; the overall latency of the system is hampered.
In this paper, a new area-time efficient FPGA implementation to realize Blackman window function is suggested.
We first redesign the conventional CORDIC algorithm to eliminate scale-factor compensation network and optimize its microrotation sequence identification. We then replace the linear CORDIC processors used in the existing design by shift-add tree derived using Booth multiplication. These modifications scale down the area consumption of the window architecture, with decrease in the number of pipeline stages.
The rest of the paper is structured as follows. Section 2 provides a comprehensive idea about various window functions and the conventional CORDIC algorithm. In Section 3, we propose a new CORDIC algorithm as redesignedscale-free CORDIC. Section 4 deals with architecture for implementing the window functions. Section 5 presents the FPGA implementation and complexity issues, while Section 6 concludes the paper.
Background

Window Filtering Techniques.
Window filtering is a well-known processing technique for limiting any signal to 2 International Journal of Reconfigurable Computing short-time segment in various fields, like audio or video signal processing, communication systems, and so forth. The rectangular, Gaussian, Hamming, Hanning, BlackmanHarris, and Kaiser are some of the most common available windowing techniques [2, 11] . The selection of the available windows is based on the spectral characteristics desired by the applications. Equations (1a)-(1c) explains the Hanning, Hamming, and the Blackman window family as follows:
where N is the window length.
where
The values of α and β are determined to achieve maximum side-lobe cancellation. For Hamming window, the coefficients are calculated as α = 25/46 and β = 21/46;
The Blackman Harris window has three degrees of freedom which can be used to design a family of window functions having different window amplitudes, roll-off rates, and side-lobe rejections. The Blackman window with coefficients α 0 = 0.42, α 1 = 0.5 and α 2 = 0.08 has side-lobe roll off rate of 18 dB/octave and the worst case side-lobe level is about 58 dB; while with coefficients α 0 = 0.4243801, α 1 = 0.4973406 and α 2 = 0.0782793 the side-lobe level is 71.48 dB with side-lobe roll off rate of 6 dB/octave.
The hardware implementation of window functions invlove trigonometric computations. The primitive technique to compute trigonometric functions uses LUTs. But this approach fails to support user-defined changes in the window-length. Another popular algorithm for computing trigonometric functions is known as CORDIC (coordinate rotation digital computer) algorithm. This algorithm is used in [6, 7] for efficient window implementation in hardware and to provide application dependent changes in the window length. It uses two serially connected CORDIC processors operating in different modes, one in linear and other in circular. Inherently, the CORDIC algorithm suffers from latency issues; and the design in [6, 7] operates two CORDIC processors in series, as a result the latency is the major drawback in the existing designs of [6, 7] . Therefore, we redesign the CORDIC algorithm to minimize the number of iterations and hence reduce latency. Moreover, we replace one of the CORDIC processors with a booth multiplication shift-add tree to further minimize latency and area.
CORDIC Algorithm.
The conventional CORDIC algorithm [9, 10, 12, 13] 
Equation (2) forms the basic principle for iterative CORDIC coordinate calculations [8] . The key concept for realizing rotations using CORDIC algorithm is to express the desired rotation angle "θ" as an aggregation of predefined elementary angles, defined as:
and b is the word length. The rotation matrix R p in its original form (2) requires determining the sine and cosine values and four multiplication operations. Factoring the cosine term simplifies the rotation matrix (4) by converting the multiplication operations to shift, as the tangent of elementary angles are defined in the negative powers of two (3) as
The rotation matrix R p in (4) is applicable for anticlockwise vector rotations. To support both clockwise and anticlockwise CORDIC rotations, the rotation matrix is altered as
where μ i = 1 for anticlockwise rotations and μ i = −1 for clockwise rotations. In its original form, the CORDIC algorithm suffers from major disadvantages like scale-factor compensation, latency, and optimal identification of micro-rotations. We propose a redesigned-scale-free CORDIC algorithm to overcome these disadvantages.
Redesigned-Scale-Free CORDIC Algorithm
The proposed CORDIC algorithm is an improved version of the conventional CORDIC algorithm in circular-rotation mode. The major ideas which lead to the proposed CORDIC algorithm are as follows: (i) redefine the elementary angles to eliminate the ROM required in conventional CORDIC algorithm to store the elementary angles, (ii) extend the Taylor series approximation of Scaling-Free CORDIC [13] to provide completely scale-free solution over the entire coordinate space, and (iii) obviate the redundant CORDIC iterations using new micro-rotation sequence identification. However, the existing scaling-free CORDIC [13] is outperformed by the conventional CORDIC beyond 20 bit implementation. But since an extensive set of applications work on word lengths up to 16 bits, our aim is to redesign the scaling-free CORDIC for word-length up to 16 bits. 
Redefining the Elementary Angles.
We redefine the elementary angles used in conventional CORDIC (3) as
The above definition of elementary angles obviates the ROM required by the conventional CORDIC algorithm to store the elementary angles.
Coordinate Calculation Equations.
We derive a new set of coordinate calculation equations by modifying the Taylor series expansion of sine and cosine functions used in scaling-free CORDIC [13] . Instead of using second order approximation of scaling-free CORDIC, we use third order of Taylor series approximation. It is necessary to analyze various orders of Taylor series approximation before third order is finalized for use in coordinate equations. We compare the mean square errors in the x-coordinate and ycoordinate for various orders of approximation in Table 1 . The errors are calculated from the results obtained after simulating the CORDIC processors. The rotation matrix of the CORDIC processors was designed using the orders of approximation mentioned in Table 1 in R p (2) and given by:
The errors are calculated for 16 bit word length, for angles lying in the range [0, π/4], since for sine/cosine functions this range can be easily extended over the entire coordinate space using the octant wave symmetry. From Table 1 , we conclude that the errors are of the same order for various orders of approximation of Taylor series expansion. Therefore, to keep the hardware complexity to minimum, we choose third order of approximation. Thus, the rotation matrix of the proposed CORDIC algorithm is given by
In order to implement the above rotation matrix using shift-add implementation, we approximate (3!) to 2 3 . With this approximation, the mean square errors in the x-coordinate and y-coordinate are 1.5839 × 10 −7 and 2.7664 × 10 −7 , respectively. The errors are calculated for 16 bit word length, for angles lying in the range [0, π/4] since for sine/cosine functions this range can be easily extended over the entire coordinate space using the octant wave symmetry. As these errors are of the same order as the errors in Table 1 , this approximation does not affect the accuracy. Finally, the rotation matrix of the proposed CORDIC algorithm is defined as
Determination of Highest Elementary
Angle. The use of Taylor series approximation imposes a restriction on the highest elementary angle being used in CORDIC iterations [13] . This restriction ensures that the higher order terms neglected due to the order of approximation used do not affect the accuracy of the processor. For third order of approximation, fourth and subsequent higher order terms should be zero after the shift operation of CORDIC so that their role in mathematical operations is obviated. For a word length of N-bits, nth order term T n is zero if it gets right shift by r-bits, defined as
For third order of approximation, n = 4, the smallest value of r min and the highest permissible elementary angle are given by:
Thus, for 16 bit word length, r min = 2 and the highest elementary angle permissible is α max = 0.25 radians. 
Micro-Rotation Sequence Identification.
The proposed micro-rotation sequence generation is different from the conventional CORDIC micro-rotation identification. In conventional technique, each elementary angle is used only once; while we allow multiple micro-operations corresponding to the same elementary angle. Then, the use of every elementary angle is a must in conventional CORDIC, where as we have selective micro-rotations that depend on the angle of rotation. Further, we restrict the micro-rotations in single direction (anticlockwise) only as against bidirectional microrotations (clockwise and anticlockwise) in conventional CORDIC.
The micro-rotation sequence generator selects appropriate elementary angle for the current CORDIC iteration. Using the redefined elementary angles (6), the microrotations can be identified using the circuit shown in Figure 1 . It comprises of a priority encoder and a reset circuitry. The input to the micro-rotation sequence generator is the rotation angle θ[N − 1 : 0], where N is the word length. The priorities of the encoder are hooked in the reverse order with θ N−1 having the highest priority and θ 0 the least. The reset circuitry resets a bit of the input rotation angle to generate the residue angle for next CORDIC iteration. Since, the micro-rotation sequence generates the shift-index r i for one CORDIC iteration, it is required in every stage of the CORDIC pipeline (the implementation of CORDIC stage is discussed in the forthcoming sections). The micro-rotation sequence generation block handles the angles in the range [0, π/4]. This range can be extended to the entire coordinate space using the octant symmetry of sine and cosine functions [14] . 
where r i / = r min and α ri = 2 −ri . The maximum angle that can be handled by the microrotation sequence generator is π/4 ≈ 0.785 radians. Therefore, no more than 3 iterations of highest elementary angle (α max = 2 −2 = 0.25 radians) is required, that is, maximum of n 1 = 3 iterations are required to realize any angle of rotation in the range [0, π/4]. The rest n 2 iterations determine accuracy. To select an appropriate value of n 2 , we simulate the CORDIC processor for varying n 2 iterations, the mean square error is tabulated in Table 2 . After observing the errors in Table 2 , we can say that the errors for n 2 = 4 and n 2 = 5 are of same order. Therefore, to minimize the number of CORDIC iterations, we select n 2 = 4. We require a maximum of n 1 + n 2 = 3 + 4 = 7 iterations for the proposed CORDIC processor.
Error Analysis.
The error analysis of the proposed CORDIC algorithm is divided into two parts: (i) residue angle error and (ii) error in the coordinate values.
Residue Angle Error.
In the proposed methodology, desired angle of rotation is expressed as
where r min is minimum shift-index (11) , N is word length. We identify the micro-rotations by using the bit representation of the desired rotation angle. The residue angle error depends on the number of bits set in the radix-2 representation of the rotation angle and varies for different rotation angles. Therefore, we derive the worst-case angle error in the range of convergence [0, π/4].
The maximum number of iterations is fixed for all rotation angles. The input rotation angle with the MSBnibble value of 4 b1011, requires four iterations of r min = 2, while, three or less iterations are required for other MSBnibble values. From second MSB-nibble onwards each bit set to 1 b1 in the radix-2 representation of the rotation angle would require one iteration; therefore, maximum four iterations are required if the second MSB nibble value is 4 b1111. Since the iteration count is seven, the worst-case error is (2 −7 − 2 −16 ). This worst-case residue angle error is specific to the rotation angle of 16 b1011 1111 1111 1111, while for other rotation angles the residue angle error will be less. In the proposed 16 bit fixed point representation scheme, 16 b1011 1111 1111 1111 is 42.97
• ; the worst-case residue angle error is 0.4467
• .
Error in Coordinate Values.
For fixed-point implementation, the error is represented in terms of bit-error position (BEP). The BEP in x and y coordinates calculated using the proposed CORDIC processor is shown in Figure 2 . For a BEP of n, the conventional CORDIC requires a word length of n + log 2 n + 2-bits [15] . For a BEP of 10 bits as achieved by the International Journal of Reconfigurable Computing 5 proposed CORDIC algorithm, the conventional CORDIC will require 16 bit word length. We, therefore, compare the proposed design with the existing design using conventional CORDIC processor [7] for 16 bits.
Architecture for Implementing Window Functions
In this section, we focus on implementing the pipelined architecture to generate window functions. The length of the window function is selected by the user at run time. Currently, the architecture implements the Blackman window, but with slight modifications it can be extended to other window functions as well. In the proposed architecture, the output bit width is set to 16 bits. Figure 3 shows the block diagram for generating Blackman window function. The circuit consists of theta generator unit (TGU), window coefficient multiplier (WCM), circular CORDIC processor (CCP) and FIFO. The TGU generates the two angle values (θ = 2πn/(N − 1)) and 2θ = 4πn/(N − 1) required in the three-term Blackman window function. WCM multiplies the input signal samples with the window constants using a shift-add tree derived from Booth multiplication algorithm. CCP is used for generating the cosine terms in the window function. The FIFO is used for proper synchronization between the window coefficients having cosine terms and constants.
Circular CORDIC Processor (CCP).
The CCP is pipelined implementation of the proposed redesigned-scale-free CORDIC algorithm discussed in Section 3. A total of seven (n 1 = 3 and n 2 = 4) iterations are required (as discussed in Section 3.5), since each pipeline stage performs one iteration, the proposed CCP-pipeline is seven stages long. Each stage (Figure 4 ) is a combination of three blocks (i) the coordinate calculation unit, (ii) the shift-index calculation, and (iii) the micro-rotation sequence generation. The coordinate calculation unit implements (9) using shift-add implementation. The shift-index calculation computes the necessary shifts ((2r i +1) and (3r i +3)), required by the coordinate calculation unit. The micro-rotation sequence generation is shown in Figure 1 .
The complexity of coordinate calculation unit is equal to six N bit logic shifters and six N bit adder/subtractor. The shift-index calculation unit requires three (log 2 N + P)-bit adders, where P are the extra bits required to store the sum. Even though, the coordinate calculation unit of the proposed redesigned-scale-free CORDIC is more complex than the conventional CORDIC [8] ; the overall gate count of the proposed window architecture using the proposed CCPpipeline is reduced.
Window Coefficient Multiplier (WCM).
The WCM unit multiplies the input samples with the Blackman window coefficients (α 0 , α 1 , and α 2 ). The shift-add tree for multiplication with α 0 , α 1 , and α 2 is derived using the Booth multiplication algorithm. In radix-2 representation system, multiplication with 0.5 is equivalent to single right shift. Therefore, multiplication with α 1 = 0.5 is realized using a hardwired shifter. The coefficient α 0 is represented in 16 bit fixed-point format as 0001 1011 0010 0000, that requires four 16 bit adders and five hardwired shifters, while α 2 is represented as 0000 0101 0000 0010 and requires two 16 bit adders and three hardwired shifters. The complexity of the WCM unit is equivalent to six 16 bit adders, as hardwired shifters do not incur any hardware costs.
Theta Generator Unit (TGU).
The TGU generates the two angles given by
where N is a multiple of 2 such that N = 2 M . The difference between the consecutive values of θ is given by
For
Using binomial theorem (B.T.), we simplify (15c) to the following: Generally in most signal processing applications, not less than 16-point DFT is used which implies N ≥ 16 and M ≥ 4. Therefore, only three terms of binomial expansion are sufficient for 16 bit accuracy as follows:
For 16 bit word length and M ≥ 4, the term Δθ 4 always gets a right shift greater than or equal to 16. Therefore, Δθ 4 is zero for 16 bit word length. Figure 5 shows the block diagram representation of TGU. The angles in the windowing function are uniformly distributed over the entire coordinate space. The CCP unit handles angles in the range of [0, π/4]. Therefore, the TGU divides the entire coordinate space into octants, so that the input angle to CCP always lies in the range [0, π/4]. The octants are distinguished as shown in Figure 6 ; the TGU also generates signals for proper octant mapping of values generated by CCP.
The TGU requires three 16 bit adders, two barrel-shifters and one encoder. Figure 7 , we compare the Blackman window generated using the proposed processor with that of MATLAB inbuilt function blackman() for N = 32.
Window Generation. In
FPGA Implementation and Complexity Issues
The proposed architecture is coded in Verilog and simulated and synthesized using Xilinx ISE 9.2i Design Suite to be mapped on Xilinx Virtex 2Pro (XC2VP50-6FF1148) device. For 16 bit implementation, the proposed design consumes 1800 slices and 3371 4-Input LUTs, with a maximum operating frequency of 101.284 MHz. The total delay of 9.873 nsec is distributed as 58.7% logic delay and 41.3% route delay. The total gate count of the proposed design is 34739.
Comparison with Existing
Architecture. The CORDIC processor both linear and circular used in [7] is designed using conventional CORDIC algorithm. The scaling-free CORDIC [13] and enhanced scaling-free CORDIC [16] are currently the best available hardware designs for circular CORDIC implementation. We compare our processor with three designs: (i) the existing design in [7] using conventional circular CORDIC, (ii) replace the conventional circular CORDIC in [7] with scaling-free CORDIC [13] , and (iii) replace the conventional circular CORDIC in [7] with enhanced scaling-free CORDIC [16] . The area complexity and latency of the proposed design with three variants of existing design [7] mentioned above are compared in Table 3 .
Area Comparison.
The area of conventional circular CORDIC processor is calculated using Xilinx CORDIC IP v3.0. The Xilinx CORDIC Core is optimized for circular CORDIC computation with maximum pipelining for 16 bit word length. The gate count is 20122. In [13] , the complexity of 16 bit scaling-free CORDIC is computed to be equivalent to 1000 1 bit full adders and 597 1 bit registers. This area complexity approximately uses 16776 gates for implementation. The SFB4C architecture of enhanced scaling-free (ESF) CORDIC [16] replaces the initial four scaling-free CORDIC iterations with conventional CORDIC iterations. Thus, the complexity of 16 bit ESF CORDIC without scalefactor compensation is equivalent to 512 1 bit full adders and 420 1 bit registers, approximately equal to 9504 gates. 1 Latency is defined in terms of number of pipelining stages required by the design. 2 The gate count of the proposed design is 34739. 3 The latency of the proposed design is 10. The complexity of the 16 bit linear conventional CORDIC is equivalent to 512 1 bit full adders and 768 1 bit registers, approximately equal to 12288 gates. The other units like theta generator unit, FIFO, and adders required for realizing the window processor are common for the proposed as well as the existing design.
Latency.
The throughput of all the designs is same, that is, one data sample per clock cycle, while the latency is different and is closely related to number of iterations in circular CORDIC and linear CORDIC processor when the designs are operating at the same clock frequency. The 16 bit linear CORDIC processor uses 16-stages long pipeline. The conventional circular CORDIC processor again uses 16-stages pipeline for 16 bit word length. For the same 16 bit word length, the scaling-free CORDIC [13] processor uses 12-stages long pipeline, while the ESF CORDIC [16] pipeline is 9 stages long. Therefore, the latency of existing design in [7] with conventional circular CORDIC is 32 stages, while with scaling-free is 28 stages and with ESF-CORDIC is 25 stages.
The new redesigned-CORDIC pipeline is 7 stages long (Section 4.1). The delay of the WCM unit (Section 4.2) is three adders in serial, which can be considered equivalent to three linear CORDIC iterations. Hence, the total latency of the proposed design is 10 stages, which is far less as compared to existing design using the best of the available circular CORDIC hardware.
Delay.
The delay is the time required to generate one set of window coefficients for a window length of N when the design is operating at the maximum clock frequency. The critical path for the proposed design is the TGU. Since the existing design using the conventional circular CORDIC and the scaling-free circular CORDIC also work using the same TGU, while, the existing design using the ESF-CORDIC uses a slightly less complex TGU as compared to other designs. The TGU for the proposed design and the existing design using the conventional circular CORDIC and the scaling-free circular CORDIC generates angle in the range [−π/4, π/4], while for existing design using the ESF-CORDIC the TGU generates angles in the range [−π/2, π/2]. Therefore, the maximum clock frequency for the existing design using ESF-CORDIC is 101.983 MHz and for other designs including the proposed design is 101.284 MHz. Figure 8 compares the delay for the four designs for various window lengths. 
Conclusion
In this paper, we present an area-time efficient CORDIC based processor for realizing window functions. Currently, the architecture implementing the Blackman window function, with slight modification, can be extended to other window functions as well. We also propose a circular CORDIC processor for word lengths up to 16 bits. The redesigned scale-free CORDIC processor uses third order of approximation of Taylor series to realize scale-free CORDIC iterations. However, removal of scaling factor comes with the disadvantage of complex coordinate calculations. The micro-rotation sequence generation is optimized using a priority encoder which reduces the total CORDIC processor pipeline to seven stages. A shift-add tree derived using Booth multiplication algorithm replaces the linear CORDIC processor in the original design of window architecture. The proposed Blackman window architecture saves approximately 44.2% area and drastically reduces latency with no affect on accuracy.
