Abstract: A general mixed-radix FFT design for in-place strategy is derived and a low-complexity scheme for efficiently implementing mixedradix FFTs is proposed. In this method, we develop an accumulator that can simply and practically generate addresses for the operands, as well as the twiddle factors. This approach extends the range of FFT size and reduces the hardware complexity of any non-power-of-two memory-based FFTs. Finally, the 3780-point FFT is taken an example to illustrate the validation of the proposed method.
Introduction
It has always been recognized as very important to digital signal processing applications that one have available a general mixed radix FFT program which is, at the same time, operationally efficient [1] . Since the radix-2 FFT algorithm can be made to yield a very efficient and simple program, it was, for a long time, considered desirable to modify one's problem to fit the program and use the transform length N equal to a power of 2, which restricts its application.
With the development of the orthogonal frequency-division multiplex technique in communication systems including DMB-T [2] , LTE SC-FDMA [3] , 3GPP LTE [4] , et al, the mixed radix FFT becomes practical and useful and gains more and more attentions [5, 6, 7, 8, 9, 10] .
For different applications in various FFT processors, two key architectures, pipeline and memory based [11, 12, 13, 14] , have been proposed. Pipeline architecture offers speed advantages but it requires more hardware resources at the same time. On the other hand, although memory-based architecture is slow, it requires least amount of hardware resources and it is often implemented with inplace strategy, overwriting the previous block of data [14] . For the generalized mixed radix FFTs design, it is more suitable to employ memory-based architecture especially for long-size DFT applications. Here, the in-place approach is adopted to design mixed radix FFTs.
However, the in-place strategy has a complex design circuit. Traditional methods usually require complicated circuits, such as [7, 9] . In [9, 10] , the data access for twiddle factors is not discussed. The scheme described here can be applied to general mixed-radix FFTs, i.e. radix-r 1 =r 2 = Á Á Á =r t FFT, where r i is some radix value. By this method, an arbitrary FFT size is chosen and a low-complexity address generation is obtained.
The rest of this paper is organized as follows. In Section 2, the existing method of accessing data address is described and its limitations are analyzed. The proposed method of address generation is presented in Section 3. Section 4 shows the 3780-point FFT design as an illustrative example. Finally, conclusive remarks are provided in Section 5.
Existing method for mixed radix algorithm
The definition of a DFT is XðkÞ ¼ P NÀ1 n¼1 xðnÞW nk N . For the size N ¼ N 1 N 2 , a method to compute its DFT is just as follows:
Eq. (1) shows the relationship between the indexes n and k and the index vectors ðn 1 ; n 2 Þ and ðk 1 ; k 2 Þ, transforming one dimension ½0; N À 1 into two dimensions ½0; N 1 À 1 Â ½0; N 2 À 1. The selection of coefficients A 2 and B 1 depends on the relation between N 1 and N 2 [9] . If N 1 and N 2 are relatively prime, the prime factor FFT can be employed. Though it can eliminate the requirement for twiddle factor multiplication, the prime factor FFT makes access address generation complex. The indexes n and k can be obtain by some modulo operations. When the number of points becomes larger, they require more modulo operations to implement address generation circuits which must calculate the correct address for each input/output of each butterfly in each column. If N 1 and N 2 are not relatively prime, let A 2 ¼ B 1 ¼ 1 and (N 2 n 1 þ n 2 ) and (k 1 þ N 1 k 2 ) are less than N. Thus no modulo computation is required.
In the following proposed design, regardless of the relationship between the factors, the mixed radix FFT algorithm is expressed by an iterative equation based on Cooley-Tukey algorithm. The expression can be applied to either decimation-intime or decimation-in-frequency FFTs with ordered inputs. The proposed method requires less modulo operations than the existing methods and can be applied to implement the general mixed radix FFTs efficiently.
Proposed method for address generation
Let N ¼ Q t i¼1 r i s i (where r i and s i mean a radix value and the corresponding integer power for the i th stage FFT computation.) denote the number of points of a general mixed radix FFTs. We assume the computation order r 1 ; r 2 ; r 3 ; . . . ; r t and define a parameter s ¼ P t i¼1 s i as the total of the FFT stages. At first an accumulator (ACC) is assumed and it can be mapped into addresses of operands as well as the corresponding twiddle factors.
ACC design
Suppose that ACC ¼ ðC sÀ1 C sÀ2 . . . C 2 C 1 C 0 Þj MR as an expression, which consists of s digits. C sÀ1 is the most significant digit and C 0 is the least significant digit. The value of each digit for ACC will be analyzed. Y ¼ ðXÞj MR means that Y is expressed by X in mixed radix form.
If the architecture of address mapping is consistent for every stage, the hardware complexity is low and we only design ACC to get the access address by shifting the digit's order of ACC. Therefore, ACC is designed for every FFT stage just as follows.
When k 2 ½1; s 1 ; and r k ¼ r 1 : We get that for the current stage k, the least significant digit C 0 is just equal to the current-stage radix and the order of the (s À 1) most significant digits just follow the assumed order exclusive of the current-stage radix. Fig. 1 . It shows the same structure for every stage, i.e. it keeps the mapping structure consistent. Because the access address is expressed in mixed radix, decimal representation can be obtained with number system conversion. Because ACC is generated with clock cycle, the current-stage r k operands are obtained in r k consecutive clock cycles.
Address mapping for operands
Access addresses for operands can be mapped directly by ACC and the relationship between ACC and Addr op is Addr op ðkÞ ¼ ðr 1 Á Ár 1 |fflffl ffl{zfflffl ffl} s 1 r 2 Á Ár 2 |fflffl ffl{zfflffl ffl} s 2 Á Á r t Á Ár t |ffl ffl{zffl ffl} s t Þj MR , k 2 ½1; s and illustrated as
Address mapping for twiddle factors
Suppose two parameters ACC reverse and β which are used to compute the twiddle factor's address. ACC reverse is just the reverse-order expression of ACC. β (β consists of (s À 1) digits) can be obtained by ACC reverse .
For stage 1, because the required twiddle factors are '1', the exponents are '0'. So β is set to zeros.
For stage k (k ! 2), there are three aspects to be considered: (1) The least (k À 1) significant digits of ACC reverse are fetched as the most significant digits of β;
(2) The other (s À k) digits of β are padded zeros; (3) The most digit of ACC reverse , i.e. C 0 , is not considered. Hence, β is expressed for every stage shown as follows. According to β, access addresses of twiddle factors for radix-r k (1 k t) butterfly computation can be obtained by Addr twi ðkÞ ¼ ðn k À 1Þ Â ðkÞ, (n k 2 ½0; r k À 1). Because the addresses of twiddle factors are not correlative with C 0 , when the r k operands are obtained in r k consecutive clock cycles, the r k twiddle factors fetched in one clock cycle.
Finally, the relationship between the input and output orders is discussed and it is expressed as follows:
Input
If N ¼ 2 Â 4 s , the relationship between input order and output order is simplified to the one in [8] . According to the above discussion, the key point of the proposed scheme is to design ACC. It can be directly mapped to access addresses of the operands as well as the twiddle factors. Modulo operation is required only in designing ACC but not in mapping to generate access addresses.
Since 3780-point FFT is an important computation in DMB-T modulation system, we take it as an example to illustrate. 3780 can be decomposed into 4
Therefore, the design for each stage is shown in Table I .
Take stage 4 as an example and the generation of each parameter is shown in Fig. 2 .
n denotes the address of output of stage 3 in the memory which ranges from 0 to 3779. n 0 1 and n 00 mean the decimal representations of Addr op and β, respectively. n 0 2 denotes the write address that is the same as the read one due to the in-place strategy. 
Comparisons and analysis
Because twiddle factor is not considered in [9] and the intermediate values in [7] are hard to be implemented in hardware, there is no comparison about the address generation for the twiddle factors.
A comparison on the address generation for the operands is given. 60-point radix-2/3/5 FFT is taken as an example. Fig. 3(a) shows the proposed scheme and Fig. 3(b) shows the scheme in [9] . There are mainly two outstanding aspects of the proposed method: (a) Consistent architecture of FFT for every stage is required and only one architecture is designed to get the address address. For the method in [9] , the architecture of address generation for each stage is variable. We have to design different architectures for every stage to obtain the corresponding access address. (b) No complex modulo operations is demanded. Hence, the larger the FFT size is, the more resources and modulo operations are needed in [9] . Table II lists the number of mathematical operations. Therefore, the proposed scheme simplifies the complexity of generating addresses.
Conclusion
The aim of the discussed method in this paper is the implementation in hardware platform, especially in FPGA. The mixed radix FFT also can be applied to the field of the remote sensing, such as synthetic aperture radar. For the high performance and fast computation, FPGA instead of DSP is usually used to implement the part of the relative algorithms, such as the digital pulse compression. In this proposed method, the low-complexity design for the general mixed radix FFTs. For FFT processors, access data is the key point. The strategy in this paper makes the generation of access address simple and easy. As regards the hardware complexity, the address generation algorithm can reduce the hardware complexity and enhance the implementation efficiency at the same time. Proposed method the method [9] 2-input addition 2 6
2-input multiplication 2 9
Modulo operation 0 5
