This paper studies the effect of the signal round-off errors on the accuracies of the multiplier-less Fast Fourier Transformlike transformation (ML-FFT). The idea of the ML-FFT is to parameterize the twiddle factors in the conventional FFT algorithm as certain rotation-like matrices and approximate the associated parameters inside these matrices by the sum-ofpower-of-two (SOPOT) or canonical signed digits (CSD) representations. The error due to the SOPOT approximation is called the coefficient round-off error. Apart from this error, signal round-off error also occurs because of insufficient wordlengths. Using a recursive noise model of these errors, the minimum hardware to realize the ML-FFT subject to the prescribed output bit accuracy can be obtained using a random search algorithm. A design example is given to demonstrate the effectiveness of the proposed approach.
I. INTRODUCTION
The Discrete Fourier Transform (DFT) is an important tool in digital signal processing [1] . A treasure of fast algorithms such as the Cooley-Tukey Fast Fourier Transform (FFT) and the prime factor algorithm (PFA) FFT are available to compute efficiently DFT of different lengths. Recently, the efficient realization of the multiplier-less FFT based on the integer [2, 3] or SOPOT representation [4] , and its extension to the multiplierless sinusoidal transforms [5] have been proposed. The main objective is to avoid the expensive general-purpose multipliers which are replaced with limited number of shifters and adders. However, this approximation will unavoidably introduce errors which are referred to the coefficient round-off errors. Fortunately, as proposed in [4] , tradeoffs between the arithmetic complexities and the output accuracies can be made so that the minimum arithmetic complexities can be obtained for different applications, which require different degree of the error tolerance. Due to finite wordlength of internal representation, another source of error, called signal round-off error [1] , occurs when rounding is performed for the intermediate data after complex multiplication with the twiddle factor. Moreover, overflow can occur due to insufficient internal wordlength when fixed-point arithmetic is used. Unfortunately, most design methods for the multiplier-less FFT only focus on the effect of the coefficient round-off errors. In order to satisfy the prescribed output accuracy, one usually employs fixed but rather long wordlength for all intermediate data, which means increased hardware complexity. Therefore, it is necessary to design a general model to determine the minimum hardware complexity, subject to a given output accuracy.
In this paper, we propose a new recursive round-off noise model for computing the output bit accuracies of the ML-FFT under finite wordlength effect. Without loss of generality, the decimation-in-time (DIT) radix-p ML-FFT is used as an example. The noise sources due to the rounding operations performed after multiplications are first identified at each stage, based on the structure of the DIT radix-p FFT, where the size of the transformation N is the integer power of p. For each output point at any stage, its noise powers are determined statistically by its associated noise sources, using the commonly used uncorrelated white noise model. Together with the noise powers coming from the previous stage, the total noise powers of all the output points can be calculated, and propagate to the next stage.
Eventually, the final output bit accuracy of each output point can be obtained by summing the total noise powers accumulated at this output point. Using these results, the internal wordlength of each intermediate data can then be optimized subject to prescribed output accuracy using a random search algorithm [8, 9] . As an illustration, the number of adder cells and registers used, which is related to the exact wordlength used for each intermediate data, is chosen as a measure of the hardware complexity. Design result shows that our proposed approach can efficiently determine the minimum hardware complexity subject to prescribed output bit accuracy. The rest of this paper is organized as follows: Section II describes the ML-FFT algorithm based on the DIT radix-p FFT. Section III is devoted to the error analysis of the ML-FFT and the wordlength determination method. A design example demonstrating the effectiveness of the proposed approach is given in Section VI. Finally, conclusion is drawn in Section VII.
II. THE ML-FFT ALGORITHM
A.
-The decimation-in-time (DIT) radix-p FFT
The discrete Fourier transform (DFT) of an N-point sequence {x(n)} is given by: 
; r is the range of the coefficients and t is the number of terms used in each coefficient. Using these results, the number of SOPOT terms can then be optimized using the random search algorithm [7] subject to the specified errors between the candidate transform and its ideal counterpart. These errors due to the SOPOT approximation are called coefficient round-off errors, which can be reduced by using more SOPOT terms. Interested readers can refer to [4] for more details. In next section, we shall present the analysis of another noise source called signal round-off error, which will also affect the output accuracy of the ML-FFT.
III. ROUND-OFF ANALYSIS OF THE ML-FFT
A. simplicity. These errors only exist in the radix-8 or higher radices FFT algorithm. In the radix-2 and radix-4 FFT algorithms, the twiddle factors jr p Ŵ are 1 or i only, so the DFT does not require any multiplications and there is no rounding error in their implementation. In the rest of this section, p is assumed to be equal to 2 or 4. However, it can easily be generalized to higher radices or split-radix FFT algorithms. Next we will discuss the determination of ) (
The round-off noise introduced by the rotation-like matrix R in (2-4) can be computed as in the figure 3. If rounding is performed after each multiplication, three additive noise sources will be introduced as shown in the figure. Let can be written as follows:
(3-2) and (3-3) can also apply to the ML-FFT with ) 2 / tan( and sin replaced by their SOPOT approximations.
By interchanging the summation signs in (2-2), we can see that the real and imaginary parts of the output signal increase by no more than a factor of p from stage to stage, assuming that both real and imaginary part of the output signal are less than one. Therefore, to avoid overflow of the immediate data, the outputs of the p-point DFT are usually scaled by a factor of 1/p. However, another noise sources ) ( 
, (3) (4) (5) (6) assuming that the round-off noises are uncorrelated. For other radices, the DFT might introduce additional noise sources, which depend on the exact implementation. Another point worth mentioning is that another scaling factor of N at the final output is needed in order to obtain the correct DFTs, but this matter would not be taken into account for our noise model. ) ( Further, if we assume ) ( Ŵ , and the noise powers at the previous stage. Note that the noise powers will be accumulated and eventually propagate to the final stage. In order to satisfy the required output accuracy, the noise power at each output should be reduced by increasing the internal wordlengths for the fractional bits at different stages in the FFT structure. B.
-Overflow handling
Signal overflows occur when the allocated wordlength of the integer bits is insufficient to handle the increase in the integer bits of the output signal after additions. More bits should be allocated to the integer part of the adder output and the register holding it so as to avoid signal overflow. There are two approaches to deal with this situation. The number of bits in the fractional part can either be retained or decreased, depending on the required output accuracy. Obviously, the latter one will introduce additional round-off noise. To determine whether overflow will occur at a particular adder, a conservative measure is used. In this approach, the addition operates in a way that all the signs of the signals will be ignored. Therefore, the worst-case wordlength format at the adder output can always be found. This can ensure that no overflow will occur at any adder output, at the expense of slightly increased hardware complexity. In the next subsection, we will describe the approach to determine the internal wordlength with prescribed output accuracy. C.
-Wordlength determination
For a given output accuracy, the idea of the random search is explored to minimize the hardware complexity [8, 9] using the proposed noise model of the FFT algorithm. To start with, an objective function regarding the hardware complexity has to be set up. As an illustration, the number of adder cells and/or registers is employed as a measure of the hardware complexity since it is always the major resources in the hardware environment. Also, it is related to the internal wordlengths for the intermediate signals which are the variables that we want to optimize. Note that other meaningful measures can also be used instead. In general, the determination of the internal wordlength can be done in three steps. First of all, the SOPOT approximation of the twiddle factors is found as discussed in section II-B. Secondly, the format of maximum wordlengths for the intermediate signals, including both real and imaginary parts can be obtained by assuming all the inputs are in their maximum values and the precisions of all the signals are retained. Finally, from the sections III-A and III-B, the noise powers introduced by the rounding operation as well as the output bit accuracies can be statistically calculated, with respect to the proposed wordlengths which are found by the random search algorithm. These proposed wordlengths are stored in the vector f which will be optimized together with the other vectors storing all the intermediate signal formats. The one with the minimum number of adder cells, while satisfying the prescribed output accuracy, will be declared as the solution of the problem. More precisely, we can formulate the problem as follows: , ) , ( min ) , ( spec total f P P to subject C f (3) (4) (5) (6) (7) (8) (9) (10) where total P and spec P are respectively the total noise power and the specified output accuracy at the DFT output, and ) ( C is the objective function in this problem. There are two ways to speed up the search process. The first one is to identify the symmetries stage by stage because of the fact that all the input signals are assumed to be the maximum so that the number of variables can be largely reduced. For instance, to calculate the wordlengths for all the intermediate signals within a radix-2 64-point FFT, only the first two output points at the first stage are required to examine, rather than all 64 output points. The second one is to make a reasonable initial guess in order to shorten the search time. Higher noise power, say, is allowed at the earlier stage. Thus, shorter wordlength is allocated at the earlier stage such that the overall output bit accuracies at the final stage still satisfy the requirement.
IV. DESIGN EXAMPLE
This example shows the effectiveness of the round-off noise model as proposed in the section III. To start with, let's consider the 64-point radix-2 (i.e. p = 2) FFT with the prescribed accuracy at each output equal to 16. As mentioned earlier, the 2-point DFT can be implemented without any multiplications and its outputs at each stage are scaled by a factor of 1/2. The input sequence {x(n)}, n = 0,1,…,63 are complex values with both real and imaginary parts having the format of <1/13>. i.e. 14 bits with the maximum value equal to 0.99988. After the wordlength optimization as discussed in section III, the entire FFT structure requires 36070 adder cells. Table 1 and 2 show a summary of results and internal wordlength formats at the 6-th stage of the ML-FFT. The output formats at the remaining stages are omitted due to page limitation. To give an idea of the hardware savings of the proposed structure, a comparison with the structure using fixed wordlength is considered below. For the sake of the comparison, the internal wordlengths for all intermediate signals are fixed to 19 bits. The corresponding number of adder cells is 36936 which is slightly higher than that of the proposed one. For simulation purpose, we use Matlab to model the hardware implementation of the ML-FFT and assume that both structures are free from the coefficient round-off errors. Figure 4 shows the output bit accuracies of the radix-2 64-point ML-FFT using proposed wordlength (solid line) and fixed wordlength (dotted line), taking an average over 10000 random generated binary data with the format of <1/13>. Result shows that our proposed structure meets the required bit accuracy quite well with slight deviation upon the 16-bit accuracy. Also, it shows that the output accuracies of the proposed structure are in general higher than that of the structure using fixed wordlength, except for those at output point 0, 15, 31 and 47. The reason is trivial due to the fact that there is no non-trivial multiplication when tracing the data path associated with those 4 output points from the first stage to the final stage. On the other hand, figure 4 also reveals that the more the multiplications with the twiddle factors, the lower the bit accuracies or the higher the round-off errors for that output point are suffered. As we expected, our proposed approach can efficiently control round-off errors by adjusting the internal wordlengths so that the prescribed output bit accuracies are satisfied without any overflow.
V. CONCLUSION
An error analysis of ML-FFT, using the DIT radix-p FFT as an example with N being an integer power of p, is presented. ML-FFT parameterizes the twiddle factors in the conventional radix-p FFT algorithm as certain rotation-like matrices and approximates the associated parameters by sum-of-power-oftwo (SOPOT) or canonical signed digits (CSD) representations. Apart from the error due to the SOPOT approximation, there is another error called signal round-off error which also affects the output bit accuracy of the ML-FFT. A recursive noise model is developed to model the statistics properties of these errors. By using this model, a random search algorithm is proposed to efficiently determine the minimum hardware complexity to realize the ML-FFT subject to the prescribed output bit accuracy. Simulation results show good agreement with the theoretical results. Table 2 : Proposed wordlengths of the output formats at the 6-th stage for the 64-point radix-2 ML-FFT (O/P = Output Point, Real. = Real Storage and Imag. = Imaginary Storage).
