SUMMARY In this work we consider optimized twiddle factor multipliers based on shift-and-add-multiplication. We propose a low-complexity structure for twiddle factors with a resolution of 32 points. Furthermore, we propose a slightly modified version of a previously reported multiplier for a resolution of 16 points with lower round-off noise. For completeness we also include results on optimal coefficients for eight points resolution. We perform finite word length analysis for both coefficients and round-off errors and derive optimized coefficients with a minimum complexity for varying requirements.
Introduction
Computation of the discrete Fourier transform (DFT) and inverse DFT is used in e.g. orthogonal frequencydivision multiplexing (OFDM) communication systems and spectrometers. An N -point DFT can be expressed as
where
N is twiddle factor, the N :th primitive root of unity with its exponent being evaluated modulo N , n is the time index, and k is the frequency index. Various methods for efficiently computing (1) have been the subject of a large body of published literature. These methods are commonly referred to as fast Fourier transform (FFT) algorithms. Also, many different architectures to efficiently map the FFT algorithm to hardware have been proposed [1] .
A commonly used architecture for transforms of length N = b r is the pipelined FFT [2] [3] [4] [5] [6] [7] . The pipeline architecture is characterized by continuous processing of input data. In addition, the pipeline architecture is highly regular, making it straightforward to automatically generate FFTs of various lengths. Figure 1 outlines the architecture of a Radix-2 i , single-path delay feedback (SDF), pipelined FFT architecture for N = 256. This architecture is generic Manuscript received July 07, 2010. † The authors are with the Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden. http://www.es.isy.liu.se/ a) E-mail: fahadq@isy.liu.se b) E-mail: oscarg@isy.liu.se while the required ranges of each complex twiddle factor multiplier for different algorithm are outlined in Table 1 [5] [6] [7] [8] .
We will from now on denote a multiplier with a twiddle factor resolution of N points around the unit circle a W N -multiplier. For small ranges of the twiddle factor multipliers it is advantageous to use arithmetic circuits optimized for the required coefficients rather than general multipliers. A W 4 multiplier only performs multiplication by one of {1, j, −1, −j}, in practice 1 or −j, and is most commonly realized by combining it in the subsequent butterfly (often denoted BFII as opposed to the standard butterfly BFI following [5] ). For larger N it is common to utilize the octave symmetry of the coefficients. This means that only twiddle factors corresponding to angles in the range 0 ≤ α ≤ π/4 needs to be considered. This is equivalent to twiddle factors in the range 0 ≤ m ≤ N/8. Multiplications for other values of m can be obtained by optionally swapping outputs (symmetry around ℜ(m) = ℑ(m) and ℜ(m) = ℑ(m)) and negating one or both outputs (symmetry around ℜ(m) = 0 and ℑ(m) = 0). A complete complex multiplier based on complex constant multiplication is shown in Fig. 2 . In previous work, the complex multipliers for W 8 and W 16 have been replaced with constant complex multipliers based on shift-andadd networks for performing the multiplication [10] [11] [12] .
In this work, we propose a low complexity complex constant W 32 -multiplier based on trigonometric identities. Furthermore, we revisit and slightly improve the W 16 -multiplier proposed in [7] and, for completeness, calculate the coefficients and complexity of the W 8 -multiplier. A preliminary version of this work was presented in [9] . In this extended version, we have included results on the finite word length properties. It is shown that the coefficient quantization can lead to larger errors than considered in [7, 9] . Furthermore, expressions for the round-off noise are derived. Based on this, a slight modification is proposed for the W 16 -multiplier from [7] . The rest of the paper is arranged as follows. In the next section, the complex constant multipliers are introduced. Then, in Section 3, coefficient quantization and data round off errors are analyzed. Then, in Section 4, results are presented and finally, some conclusions are given in Section 5. The constant multiplier can be realized using a minimum number of adders using the method in [14] .
Complex Constant Multipliers

W 16 -Multiplier
In [7] , a W 16 -multiplier based on the trigonometric identity sin 2θ = 2 sin θ cos θ,
was introduced. Hence, as 2 Table 2 . The resulting structure is shown in Fig. 3 . Note that multiplication by two is equivalent to a left-shift, and, hence, is not considered a multiplication. The structure shown in Fig. 3 is slightly modified compared to that in [7] : two multiplexers are added at the output to allow multiplication by 1 and also the constant coefficient interchange to reduce the round off noise in the structure. Furthermore, it was in [7] suggested that the multipliers should be implemented based on the canonic signeddigit (CSD) representation. In the current work it is instead suggested to use minimum adder multipliers from [14] .
W 32 -Multiplier
For the W 32 -multiplier, we propose to use a similar approach, i.e., based on trigonometric identities identify a small number of constant multiplications that can be combined to form all the remaining coefficients. In our proposed approach these constant multiplications are sin Table 3 . One possible structure for the resulting complex constant multiplier is illustrated in Fig. 4 with the corresponding control signals shown in Table 4 . The proposed architecture will be imple- mented by constant multiplication, multiplexers and adders (subtracters). 
Finite word length error analysis
As the proposed structures are based on combinations of several multiplications it is of interest to consider the errors due to coefficient and data quantization. From an FFT point of view, the coefficient quantization will lead to a static deviation from the ideal DFT response, while the data quantization can be seen as a noise source affecting the data. Here, we will consider the absolute magnitude error of the coefficients.
Coefficient quantization error
We can represent the coefficient quantization error for coefficient c, with quantized value c q , as ∆ c where
Now, if we use rounding with B fractional bits, we know that |∆ c | ≤ 2 −(B+1) . However, given that we know the 
Where we in (6) consider the first order error terms. Summarizing these errors for the W 16 multiplier we get the error expressions presented in Table 5 . The actual errors using rounding for the partial coefficients are shown in Fig. 5 which shows the relative error in ulps † . It is clear from this figure that for 7 out of the 16 considered word lengths the magnitude of the error is larger than 0.5 ulp which breaks the precision requirement of the sin π 4 multiplication. Conventionally, the word length will increased by one or more bits to achieve required precision. Specially, for the above cases increasing the word length by one bit for one or both of the partial coefficients to meet the specification of all except for the 6 and 12 fractional bits cases. In these cases, the word length must be increased by the two bits to fulfill the requirement of the precision. However, the need for this should be considered on the † Unit of least position, i.e., the weight of the least significant bit of the representation. When using B fractional bits, ulp = 2 −B . system level by evaluating the effect of these additional quantization errors. Note that the magnitudes of the errors for the used partial coefficients are smaller than 0.5 ulp, which is expected as these are derived directly by rounding. Similarly, error expressions are computed for the different coefficients of the W 32 -multiplier based on the architecture in Fig. 4 . The error expressions of the W 32 -multiplier is tabulated in Table 6 based on these expressions and the error for varying word lengths is shown in Fig. 6 for those coefficients that are using more than one constant multiplication. Figure 6 shows that, except for 20-bits resolution, at least one of the derived coefficient breaks the precision requirement.
Round-off noise
In fixed point representation, it is infeasible to increase the word length after each intermediate multiplication stage, product result must be quantized to W -bits. When it comes to data quantization errors this is often modeled as a random noise source with statistical properties, determined by the quantization model and word length. In the proposed architecture one will typically quantize the data after each partial multiplication which is shown in Figs. 7 and 8 . sults are presented in Tables 7 and 8 for the W 16 and W 32 -multipliers, respectively. In the original W 16 -multiplier introduced in [7] , the round-off noise term for the sin 
Compared to the proposed modified W 16 -multiplier, the round-off noise is reduced corresponding to about one bit lower data word length.
Results
Coefficient quantization and optimized coefficient
As discussed previously, the W 16 and W 32 -multipliers are composed of several constant multiplications. Then, the coefficient quantization error of the individual multiplications are combined. While this may lead to cancellation of quantization errors having opposite signs, it may also lead to that the total coefficient quantization error is larger than the individual coefficient quantization errors. The straightforward way of handling this is to increase the word lengths of the individual multiplications until the total error meets the specification. Addition aware quantization [13] provides a better way of obtaining this increase in accuracy of coefficients. In [13] , E additional fractional bits is used to realize that there are exactly 2 E different representable coefficients for which ǫ ≤ 2 −(N +1) , including the one obtained by rounding to N fractional bits. These 2 E combinations are searched for the best solution.
For each precision requirement, the solution with smallest maximum quantization error among those solutions with the smallest addition count is selected. Here, the coefficient quantization errors of the W 16 and W 32 -multipliers are shown in Figs. 9 and 10 , respectively. In Fig. 9 , it can be seen that in all 7 out of 16 which were breaking the precision requirements in the rounded version are now meeting the precision requirements. In the W 32 -multiplier, 15 out of 16 cases which was breaking the precision requirement point, now are within the precision requirement.
For a W 8 -multiplier implemented with constant co- Fig. 10 Relative quantization errors for the coefficients in W 32 -multiplier using addition aware quantization. efficient, the optimal coefficients have been tabulated with fractional bits range and correct bits in Table 9 .
Corresponding results for W 16 and W 32 -multipliers are tabulated in Tables 10 and 11 , respectively. The hardware resources comparison of the straight forward approach and addition aware method in terms of required number of additions are shown in Figs. 11 and 12 for W 16 and W 32 multipliers, respectively. It can be seen that in rare cases the addition aware method can even decrease the number of additions, as can be seen for eight bits.
Comparison with previous method
Here, a comparison with the previously proposed methods in [10, 11] are presented. The reduced Boothlike multipliers in [11] are based on the observation that when the set of coefficients is known, the Booth- encoding logic can be simplified as well as the partial product accumulation tree. Here, we have assumed that four multiplexers are required for each non-zero position in the accumulation tree. This will in practice for some positions be higher. To use multiple constant multiplication (MCM) and a multiplexer to select the correct coefficient was proposed in [10] . For the results presented here, the algorithm in [16] is used, which in a Using minimum adder multipliers from [14] . b Using CSD-multipliers as proposed in [7] .
general should provide better results compared to the algorithm used in [10] .
The complexity results of the W 16 -multiplier are shown in Table 12 for a varying number of fractional bits. It is clear that using minimum adder multipliers † [14] is better than CSD multipliers, which is not surprising since CSD multipliers is a subset of minimum adder multipliers. Compared to the reduced Boothlike multipliers, the considered multipliers always have a lower complexity, both in terms of adders and multiplexers. Finally, the MCM approach is as good or slightly worse compared to the complex constant multiplier in Fig. 3 . When it comes to the W 32 -multipliers, the results are shown in Table 13 . Here it can be seen that the adder complexity is typically slightly smaller for the reduced Booth multipliers proposed in [11] compared to the proposed complex constant multiplier in Fig. 4 . However, the number of multiplexers is higher in all cases and in most technologies this should mean that the proposed complex constant multiplier has a lower total complexity. Compared to the MCM approach the proposed multiplier has fewer or as few adders, except for the case with nine fractional bits. The advantage of the proposed multiplier increases as the word length increases.
Conclusions
In this work, the design of reconfigurable complex constant multipliers was considered, with the focus of rotators in fast Fourier transforms. A multiplier for 32-point resolution was introduced. In addition, a slightly modified previously proposed multiplier for 16-point resolution was discussed. For these two multipliers, finite word length properties for both data and coefficient quantization was discussed and optimal coefficients were derived. For completeness, the optimal co- Fractional Proposed (Fig. 4) Red. Booth [11] MCM [10] bits Adders MUXs Adders MUXs Adders MUXs
