Abstract: This work aims for providing low power, high performance receiver architecture for MC-CDMA receiver. We provide a dynamically reconfigurable FFT architecture to switchover from 64 points to 16 points based on the channel parameters(i.e.)delay spread, which otherwise would have been designed for the worst case FFT length which are our potential source of power reduction. Clock gating with FFT point reconfigurability technique is used to switchover from low length to high length FFT or vice versa. This work explores different possible ways of power reduction in the major internal blocks of FFT based on the data dependency. The simulation results are compared in terms of power dissipation, area and performance with the existing system. Results show that we could achieve a overall power reduction of more than 50% and improvement in the performance to about 24% with slight increase in area of 4.13%. The proposed architecture is modeled using Verilog HDL, simulated using NCLaunch of Cadence and synthesized with TSMC 180 nm and also with 45 nm Technology.
Introduction
The Fast Fourier Transform (FFT) is one of the fundamental operations in the field of DSP, telecommunications, speech and image processing, etc. Recently, the FFT is used as one of the key components in OFDM based wideband communication systems. It is desirable for wireless receivers to adapt their operation rather than only be targeted for fixed or worst case scenario. These receivers will have to be flexible enough to accommodate various operating conditions while meeting computational requirements and simultaneously achieve low-power consumption. The combined requirement on providing high-performance and flexibility with low-power consumption is the focus of the architecture implemented in this paper. The commutator and complex multiplier contribute a dominating part of the whole power consumption and acts as a leading actor with the increase of FFT size. The complex multiplier has high switching activity between successive coefficients fed to it and by coefficient ordering its power consumption can be reduced.
Previous works
In [1] , the authors proposed a low-power FFT architecture based on multirate signal processing and asynchronous circuit technology. The communication is localized, and the sharing of the global memory is eliminated. To reduce the number of operations in FFT and thus reduce power consumption, some researchers use shifters and adders to replace the complex multiplications by some special constant coefficients. The authors of [2] employ seven shift-andadd units to carry out seven multiplications in parallel, each by a constant coefficient. Papers [2, 3] proposed a multiplier less architecture based on common sub expression sharing, which replaces the complex multipliers in FFTs, and a low-power commutator architecture, which reduces the number of memory accesses. In addition to the pipelined FFT, the parallel-pipelined FFT is a good solution for applications requiring high throughput and high power efficiency. This paper aims for developing an FFT Processor with re-configurability, low-power and high performance. The approach is to implement basic parts that can be combined together to form a reconfigurable architecture, such that these parts should be reusable and the architecture should be adjustable in several parameters. In our case for e.g. FFT size (in case of FFT Processor) [5] .
Proposed reconfigurable system and its architecture
A reconfigurable 64-point R4SDC pipelined FFT processor architecture comprises of three radix-4 stages as shown in Figure 1 . Re-configurability is achieved by inserting a multiplexer namely MUX I for selectively enabling the stage-64 and stage-16 for directly routing the input data depending upon the required FFT size. The FFT processor can act as a 16-point processor by feeding the input data directly into stage-16 and clocking down the stage-64. This is accomplished by selecting the input data rather than the output of stage-64 by the external select input S16 of MUX I. Moreover, the gated clock Then the corresponding multiplier and adder blocks in the FFT blocks can be disabled so as to reduce the computation, similarly if the 1st data is followed by '0000001' then the multiplier can be disabled and the multiplicand value can be passed on to the next stage. So these types of data dependencies of the input data have been exploited for the reduction of power in the computational elements of the internal blocks of the FFT architecture.
Algorithm for dynamic reconfiguration of FFT points
The dynamic reconfigurability of FFT point follows this algorithm
Get the variation in the channel parameters like Delay spread (D) and
Signal to Noise Ratio (SNR).
2. Let a1 and a2 be the threshold value of SNR and b1 and b2 be the threshold values of delay spread (D). 6. Repeat step 4 and 5 for different value of varying channel parameters like SNR and D.
Canonic Signed Digit (CSD)
The complex multiplication in the FFT is done by using CSD and sub expression sharing technique. The advantage of CSD form is that no value has more than (N+1)/2 non-zero bits. The filter coefficient constant 5a82 is represented by two's complement format, and 7641 and 30fb are represented by CSD format as follows: 5a82 (0101101010000010), 7641 (1000 10 1001000001), and 30fb (010 1000100000 10 1). The mixed use of CSD and two's complement minimizes the number of addition/shift operations as shown in Table I . We can use the shift-add based implementation of multiplications with the three constants to carry out those non-trivial complex multiplications. According to the previous representation, these multiplications with the three constants are given by: 5a82X = X<<1 + X<<7 + X<<9 +X<<11 + X<<12 + X<<14 7641X = X + X << 6 -X << 9 -X << 11 + X << 15 30fbX = -X -X << 2 + X << 8 -X << 12 + X << 14 Where X represents the input data [4] .
Common sub expression sharing
Common sub expression sharing shares the sub expression among several multiplication-accumulation operations in order to reduce the total number of operations [4] . This approach is very effective for reducing the hardware cost of multiple constant multiplications, especially for the filter-like operation. As discussed in section 2.2, the number of operations required for the computation of 5a82X, 7641X, and 30fbX are shown in Table I . In all nontrivial coefficient multiplications, the proportions of the multiplications referring to 5a82, 7641, and 30fb are 50%, 25%, and 25%, respectively. Hence, the average operations for a nontrivial coefficient multiplication are 4.5 additions, 2.5 subtractions, and 7 shifts. The weights Ai are the filter coefficients.
The block diagram of multiplier less unit is depicted in Fig. 2 (a) . Only data that has to be multiplied with non-trivial complex coefficients is fed into the shift-and-add units. Two shift-and-add units are needed for both the real part (Xr) and the imaginary part (Xi). There are two single-bit control signals, s6 and s7, in the multiplier less unit. Signal s6 indicates whether the input data corresponds to a nontrivial complex coefficient. When signal s7 is asserted to logic 1 state, the real and imaginary parts of the input data are swapped, and the imaginary part is inverted. Otherwise, the swap unit passes the input data unchanged. Here, in the multiplier less unit, 22 adders are used to substitute the four real multipliers in the complex multiplier unit. The data control unit passes the data to the multiplierless unit based on the input sequence. If subsequent input is zero it disables the complete unit and provides the result '0'. If '1' follows it simply gives the coefficient result thereby reducing the computational power. 
IDR commutator
The proposed architecture uses the same IDR commutator discussed in Ref. [4] , which uses dual port RAMs as FIFO elements, however, the interconnection topology among the RAM blocks is different from that of the conventional approach. The RAM blocks (DM0, DM1,....DM5)are appropriately enabled and disabled so as to provide low power. Table II illustrates which RAM blocks are enabled for write access during each period. It can be seen that there are at most three RAM blocks selected in a given period. For stage t, when Mt is equal to 1, new Nt-1 input data is processed. The first Nt data will be written into DM0. The previous Nt data stored in DM0 will be read out and written into DM2 for vacating space for the new data. The same applies for DM2 and DM4. The other three RAM blocks (DM1, DM3 and DM5) will be disabled for write access during this period. For Mt = 0 and 3, the number of RAM blocks enabled is two, because the previous data stored in DM2 and DM3 are no longer needed for subsequent outputs. Therefore, during the four periods, each RAM is enabled 5/3 times on average. Whereas, for Dual-port RAM(DR) and Triple-port RAM(TR) architectures this corresponds to 4 and 10/3 times respectively. Hence our new commutator architecture is significantly more power efficient compared to other commutator architectures. 
Proposed low power butterfly architecture
The architecture showed in Fig. 2 (b) aims to reduce power by removing unwanted sub blocks from previous architecture [4] . This architecture was constructed based on Radix-4 decimation in time. Input data's are two's complemented based on the control signals c4, c5 and c6. This architecture uses viterbi decoder which is efficient in the aspect of power and compressed adders are used in order to perform the summation. Multiplexers are removed by replacing select lines and two's complement blocks are used instead of using one's complement.
Simulation and results discussion
The proposed architecture has been modeled with Verilog HDL and the system level simulation has been done using NC Launch of Cadence. The design has been synthesized with 180 and 45 nm (slow.lib) tech file from TSMC. The backend design has been completed using Cadence SOC Encounter with 180 nm tech file from TSMC. The Nano routed view of our design has been shown in the Fig. 3 (a) and the power comparison of the proposed FFT sub blocks with some of the existing work is shown in Fig. 3 (b) . The selection of FFT size depends on the value of the delay spread, for the delay spread produced by De-spreading module in the combiner block between 350 ns to 900 ns we can transmit it using 64 point FFT. Whereas for delay spread above 900 ns 64-point FFT switches to 16 point FFT by using clock gating technique to save the power consumption. Fig. 4 .
Conclusions
In this paper, a new reconfigurable FFT architecture has been designed and it is made efficient in the aspect of power and speed by applying data reorder- The comparison result is made for 16 point FFT; further the design is made, simulated and synthesized for 64 point with re-configurability. The simulation results shows more than 50% power reduction with 24% increase in the speed at the cost of 4% increase in the area for 16 Point Reconfigurable FFT architecture.
