Abstract
Where v d d is the supply voltage, f is the clock frequency, C/ood is the load capacitance of the gate, and k is the switching activity factor which is defined as the average number of times the gate makes an active transition in a single clock cycle. Therefore, for achieving low power in CMOS circuits one must target minimising one or more of the parameters and k. This paper primarily deals with low power architectures obtained by reducing the switching activity.
A number of researchers has investigated the area of low power implementation of FFT processors. In [5] , the author has implemented a low power cache-memory architecture by using an algorithm that offers good data locality over large portions of the computation. This architecture is more suitable for longer length transforms. In [6] [7], the authors have proposed a low power architecture based on an algorithm that effectively minimises the number of complex multiplications.
In [8] [9], the authors have investigated the realisation of low power FFT processors by using asynchronous processing elements. Work on the application of order based processing to FFT coefficients is restricted. The only reported work involves order based processing of coefficients and data at the inputs of the FFT computational units so as to minimise the overall switching activity of successive coefficients and data samples [ lo] [ 1 11. This results in coefficient switching activity reduction of just 19% whereas the data activity increases by 1% for a 9-point FFT [Ill. The authors in [lO] [ll] have not shown the actual power reduction obtained by their scheme in the presence of hardware overheads in the full FFT architecture.
This paper reduces the power consumption of a popular low power radix-4 pipelined FFT processor architecture proposed in [12] by modifying its operation sequence. The complex multiplier within the butterfly processing unit is one of the most power consuming block of the pipelined FFT processor. The switching activity between successive coefficients fed to the complex multiplier can be drastically reduced by coefficient ordering and hence its power consumption. The coefficient ordering requires corresponding data sequencing as per new coefficient ordering. Data sequencing is performed by a commutator in the pipelined FFT processors. Hence, a novel commutator architecture is proposed to handle the new data sequencing for stage 1 of a 16-point FFT processor. The data sequencing for stage 2 is restored by using a dual port RAM (04 along with a ROM for its address generation. This ordering technique is suitable only for stage 1 of a 16-point radix-4 FFT processor due to the need of restoring data ordering for the following stage. This in turn requires only small hardware overhead in the form of a six word additional 
ALGORITHM
The N-point DFT of a finite duration sequence x(n) is defined by (2) as follows. 
ORDERED PIPELINED FFT ARCHITECTURE
A pipelined N-point radix-4 FFT processor based on the previously described algorithm, shown in Fig. 3 , will have lo&N stages. Each stage produces one output within each word cycle. Each stage contains a commutator, a butterfly element (for summation) and a complex multiplier. The sequential outputs at each stage must be ordered in accordance with the value of m,. For instance, from Fig. 2 at stage 1, the outputs associated with ml=O are produced in the first four word cycles, then those associated with ml=l in the next four cycles and so on. As seen in equation (3), the input data for each summation at stage t are separated in time by Nt words. The requisite commutator comprises of six shift registers along with three multiplexers and is given in [12] . for stage I of a 16-point radix-4 pipelined FFT processor based on its signal flowgraph shown in Fig. 4 . Normally, the fixed coefficients are fed to the complex multiplier in an order given in Fig. 2 starting from ml = 0 and ending with ml= 3 for stage 1 of a 16-point FFT processor. Our approach involves ordering the coefficient sequence so as to minimise switching activity between successive coefficients fed to the This paper proposes altered operation sequencing Table I and also shown in Fig. 4 . The coefficients are ordered so as to minimise switching activity between successive coefficients by minimising the Hamming distance between them. The ordered coefficient set is obtained by first arranging only the imaginary part of the coefficient set on the basis of Hamming distance. It is followed by picking up the corresponding real part of the coefficient or its two's complement depending upon the Hamming distance with respect to the previously arranged real part. A flag bit is asserted to indicate the presence of real part in two's complement form. This flag bit is also used to selectively complement the multiplier output [ 131. The switching activity decreases from 192 to just 78, a reduction of 59% by following this ordering approach. The coefficient ordering requires corresponding data ordering. The data ordering is performed by a novel design of the commutator for stage 1 of a 16-point radix-4 FFT processor. The ordered data sequence at the output of the complex multiplier for stage 1 of the 16-point FFT processor has to be converted back into a normal data sequence for its stage 2. This data sequence conversion is accomplished by the combination of ADM along with a ROM (ROMO) for its addressing as given in Fig. 5 . The new architecture of the 16-point'ordered pipelined FFT processor is shown in Fig. 5 . The input and output sequences of ADM namely DI and DO respectively are shown in Fig. 6 . It is clear that DO is in normal order to be directly fed to the stage 2 commutator. The stage 2 commutator will be the same as given in [12] . The stage 1 commutator design, to support the ordering scheme, is as follows.
A . Stage I Commutator design for a 16-point FFT Processor
As seen in equation (3) and Fig. 2 , the input data for each summation at stage 1 of a 16-point FFT are separated in time by four words. The timing of the ordered data sequence corresponding to the ordered coefficient sequence and the normal sequence as a function of time is shown in Fig. 7 , t' is the instant when the first input word arrives. Each input word occupies a word slot of duration T and is numbered according to its appearance in time. This ordered data sequence can be generated with the help of a commutator. It is difficult to generate the ordered data sequence with the help of a conventional FIFO based on shift registers (SRs) or DMs as given in [12] . In order to achieve flexibility, the commutator is constructed by using double size (eight words) three triple port RAM (TM) based FIFOs rather than six four word DM based FIFOs. The additional read port in TM greatly helps in generating the ordered sequence. The commutator comprises (cs) . TM2 is selectively disabled for writing by the logic high its chip select input. This is done to avoid unnecessary writing of TM2 thereby reducing power consumption. It is clear from the highlighted nibbles of Table I1 that the lower three bits (addra) of these nibbles remain fixed in two blocks. It means that 'addra' remains fixed for all these locations and hence there is no switching activity on port A for almost half of the time duration. This addressing approach reduces the switching activity on the on unused ports and hence the power consumption. This sort of addressing is also employed for ports C, E and F.
IV. RESULTS
The conventional and ordered pipelined .FFT processor architectures have been implemented in register transfer level hardware description language and then synthesized using 0 . 1 8~ CMOS technology library. Power evaluation was then carried out on the circuit netlist using a supply voltage of 1.8V and a clock frequency of 100 MHz. The switching activity decreases from 192 to 78, a reduction of 59% as per Table I . The comparative results in terms of power for different FFT lengths, FIFO implementation styles and two common low power multiplier types [ 141 are given in Table 111 . The FIFO was implemented in three different ways namely based on SR, DM and our TM DM-SR approach uses DM based FIFO for stage 1 commutator and SR based FIFO for stage 2 commutator in case of a 64-point processor. The design of a 16-point FFT processor has been carried out for two different multiplier types namely carry save array type (csa) and Non-Booth coded Wallace tree type (nbw). It is clear from Table IV that our ordered architecture gives power savings for the two multiplier types and different FIFO architectures for the 16-point and 64-point FFT processors. The percentage power saving of our ordered approach is less for nbw multiplier type in most cases but the nbw multiplier based architecture consumes less power than the one based on csa. Moreover, DM based approach is better for stage 1 o f a Our ordered approach gives power savings of the order of 23% and 29% with respect to SR and DM respectively for the 16-point FFT processor using csa multiplier. The power saving of our ordered approach is 28%, 13% and 9% with respect to SR, DM and DM-SR respectively for the 64-point FFT processor using csa multiplier. The percentage power saving with respect to the overall power consumption will go down fiuther for longer FFTs because the ordering approach is restricted only to stage 1 a 16-point FFT or stage 2 of a 64-point FFT processor. This restriction is imposed in view of the large ADM requirement and commutator design complexities for initial stages of longer FFTs. This large ADA4 will more than offset any power saving due to ordering in the complex multipliers. Table IV lists the power consumed by the major cells of the pipelined FFT processor for different FIFO implementations. It is clear from Table IV that the SR based FIFO architecture is inferior to the other architectures for the 64-point FFT due to the high power consumption in the large FIFO blocks of stage 1 commutator. This high power consumption is attributed to the shifting(switching) of all data samples on every clock cycle in the traditional FIFO based on SR. It is also evident from Table IV that the power saving in the ordered approach is taking place not only in the multiplier but also in the novel stage 2 commutator. The novel stage 2 commutator architecture comprises of three double size FIFOs based on TMs rather than the traditional six DM or SR based FIFOs. The new commutator architecture consumes much less power due to less data movement and hence switching activity among the three FIFOs as compared to the traditional six FIFOs. The power consumption in the additional read ports of TM is reduced by keeping the outputs of the unused read ports to their previous values by addressing these ports through ROM1. The stage 3 commutator consumes more power in the ordered approach as compared to the other approaches because its power consumption also includes the power consumed by ADM. The stage 2 butterfly in the ordered approach consumes more power than the other approaches mainly due to the different stage 2 commutator architecture. The stage 3 butterfly in our approach consumes much less power than the other approaches because it has been designed using XOR gates (control inverters) and a summer rather than the traditional programmable adderslsubtractors [12] . The stage 1 and stage 2 butterflies in our ordered approach consume more power than the other approaches due to the different inpub'output conditions in the form of a new stage 2 commutator architecture based on larger size TMs. The stage 1 multiplier consumes less power in our 
V. CONCLUSION
This paper has presented a novel order based pipelined FFT processor architecture suitable for shorter FFTs. However, this design approach can also be applied to the last stages of longer FFTs. The switching activity reduction obtained by ordering is around 59%. The corresponding power saving varies from 23% to 14% for a 16-point FFT and 9% to 4% for a 64-point FFT using commonly used low power multipliers. This low power design has a lot of promise in wireless LAN applications requiring short FFTs. It can also be used for longer FFTs in other OFDM applications like Digital audio and video broadcasting.
