We hereby introduce an extension of the CooleyTuckey algorithm (CT) 
Introduction
This article deals with a way of computing D I T when several levels of memory are available. This is usually the case for digital systems. Besides the main memory, general purpose processors may have registers, cache memory (on and off chip) : DSP may have internal memory ; ASIC may have whatever is necessary. We will consider two levels : local and main memory.
A wide class of Fourier Transform algorithms (Cooky-Tuckey, Prime Factor Algorithm -PFA - [Bur88] ) use a factorisation of N to change the unidimensional DFT into a multi-dimensional DFT in which the number of dimensions is equal to the number of factors and the sizes of the dimensions are equal to the factors.
The primitive to be applied on each dimension is called a radix-M. It is the computation of a length-M DIT followed by multiplication by a vector of twiddle factors. In some cases (PFA) the twiddle factors are all ones and the multiplication is not needed.
The standard processing way of these algorithms is to handle successively all the dimensions in computation stages. In this way, all the data are read from and writtcn to the main memory once per factor of N.
For example, the figure 1 shows a timc/address graph of a 64 points DFT computed by a radix-2 CooleyTuckey algorithm. There are six stages and in each one the whole memory is read and written once. Between two accesses to an element, nearly all the others are read. This memory-reference pattern prevents an efficient use of the local memory.
Using larger radix slightly reduces computational effort and cuts down drastically external transfers. Larger sets of data can be put in the local memory and processed, leading to a more efficient use of local memory. As a cost, complexity of the algorithm increases to such an extent that usually only small factors are used (2 or 4 for CT, smaller than 20 for PFA).
In order to improve the locality of memory references without increasing the complexity, we propose to interleave the handling of the dimensions i.e. to merge two or more computation stages into a superstage and to reorganise computations in the latter.
The figure 2 shows a timehddress graph of a 64 points FFT computed by a modified radix-2 Cooky-Tuckey algorithm. There are two superstages in which the whole data are read and written. For each superstage, 8 groups of 8 points are successively handled. What we propose is to put these points in the local memory.
In the general case, it is possible to show thal, as the dimensions are independent, the amount of local memory necded for a superstage (its size) is the product of sizes of the dimensions merged in it. The number of transrers between h e local and the main memories is then equal to For an efficient use of local memory, the size of the superstages must be as homogenous as possible. The homogeneity is easy to obtain for C T all superstages may have the same size.
Derivation of the algorithm
The standard algorithm for CT radix 2 is the following:
where x is the data array the size of which is N m = log2 N . 2 n (J) = e' -TIIn this algorithm, the k loop is the loop on the dimensions. We will consider that the number of dimensions is factorizable ( m = ml m2). In order to interleave the handling of the dimensions, we must reorganise the loops so that the loop on k , or a part of it, is inside the loops on j and i , or part of them.
The property we use is that It is easy to check that in the k2 loop, there are m2 accesses to 2m2 elements of the x array. We have merged the m stages into mi superstages of size 2 m 2 . The algorithm can also be viewed as a CT Radix 2 m 2 , the Radix 2% being computed with a Radix 2 primitive. In comparaison with the CT radix 2 , the algorithm needs more sine and cosine computations or table accesses.
If the local memory is not a cache memory, it may be needed to explicitly transfer the data from the main memory to the local memory before the k2 loop and from the local memory to the main memory afterwards.
If m is not factorable, one superstage must be incomplete. We will have m = m l m2 + m 3 , with m3 < m2. The number of superstages (r) is given by r = I m2 I
The resulting algorithm may be found in figure 3.
-+-----.
_ -

Performance analysis
First model
To point out effects of the modifications, we use the following computational model (see figure 4) . A CPU is connected to a local and a main memory via a unique bus. The CPU may read its data from the local memory in a time tL or from the main memory in a time tM. A transfer from the main memory to the local memory or vice-versa lasts t T
Figure 4
With the standard approach, the time spent for the data transfers is :
" t~
In the modified algorithm, the time needed to access the data is only 2 N m t L But it is necessary to transfer the data between the main and local memory, which takes So, there is an advantage if or Therefore the advantage of our method increases as m;, increases. Thus in applications corresponding to our computation model, m2 must be as large as possible.
Supposing that tT = t L + tM and that m is given, [2] can be discussed as a t~ t M inequation parametrised by r . The value of m2 is determined in different ways according to the constrains of the problem. Sometimes, the maximum size of the local memory is fixed and it is then possible to compute the value of m2. so that the size of the superstage is lower than the maximum. Consumption reasons can impose the value of r , leading to deduce m2 from [ll If the inequation [2] holds, then using our method is advantageous. The value of m i and m3 are computed by
Second memory model
Better performances and better use of the computation unit are allowed by implementation of the the algorithm on the model of figure 6. where r is the number of total transfers. For a given computation time T, resources are best used when r being fixed, m2, mi and m3 can be computed as before.
+ f
Application
This model has been used to design an ASIC aimed at applying spectral transformations to an 8192 x 4096 pixels image. External memory transfer time is imposed at 200 ns and maximum computing time for the 8192 complex DFT is 4 ms. These requirements can be met with r=2, m2=7 and m3= 6 and tD has to be less than 40 ns. This can be achieved by two local memories of 128 complex words. The computing unit includes two fixed multipliers, one adder and one substractor. Using the symmetry of trigonometric functions, the middle factors are stored in two 1024 real words ROM.
Conclusions and future works
An extension of the CT Radix-2 has been described providing an original cache management strategy adding to the set of trade-offs between main and local memory accesses.
The importance of distributing computation in homogeneous superstages has been pointed out for local memory size and the notion may be easily applied to PFA.
Extension of the algorithm to higher basis of radix is immediate and offers the same savings in (local) memory accesses as for standard CT implementations. Decimation in time (DIT) versions are directly deduced.
The limited size of on-chip memory makes the algorithm suitable for using it in parallel architectures like the one described in [YC92] . Specialised Digital Signal Processors may benefit from using this algorithm as soon as their internal memory cannot contain the whole N points vectors by reducing accesses to the external memory.
The algorithm is of more interest than locality of reference only : figures 1 and 2 show that the first results are available sooner. Applications involving direct DFT followed by a point by point spectral correction (multiplication by a phasing vector) and a inverse DFT can exploit this phenomena to reduce the total achievement time. In order to do so, the DIF FFT is chained with the vector multiplication and an inverse DIT FFT which has the property of reordering final results. It is possible to start the computation of the inverse DFT before the direct one terminates.
Based on the algorithm in figure 5, a block floating point DSP processor dedicated to low-power application is being implemented for 16 bits real and imaginary input data's in 0,7pm ES2 3,3V technology. [Bo1941
Bibliography
