Abstract-This paper presents a low-power design of a twostream MIMO FFT/IFFT processor for WiMAX applications. A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost. With these schemes, half the memory accesses and 64-Kbit memory can be saved. Furthermore, by proper scheduling of the two data streams, the proposed design achieves better hardware utilization and can process two 2048-point FFTs/IFFTs consecutively within 2052 cycles. A test chip of the proposed FFT/IFFT processor has been designed using UMC 
INTRODUCTION
Multiple-input multiple-output orthogonal frequency division multiplexing (MIMO OFDM) is considered a key technology in high-throughput transmissions over wireless fading channels. The emerging WiMAX/IEEE 802.16 standard has employed this technology in its physical-layer specification to provide broadband wireless access services. In the specification, scalable channel bandwidths from 1.25 to 20 MHz by adjusting FFT size (from 128 to 2048-point) are employed for different applications. Three modulation types (QPSK, 16/64-QAM) and four guard intervals modes (1/4, 1/8, 1/16, 1/32) are also supported to further increase the system scalability. A block diagram of a 2x2 MIMO transceiver for WiMAX applications is shown in Fig. 1 . By processing two data streams with duplicated antennas and functional units, the peak data rate of the 2x2 MIMO transceiver can be two-folded compared to that of a single-input single-output (SISO) transceiver.
To support a MIMO transceiver for WiMAX applications, a variable-length FFT/IFFT processor capable of processing multiple data streams is required. Since 2x2 MIMO with time division duplex (TDD) mode is defined in the WiMAX Forum Release-I system profiles [1] , a two-stream 128/256/ 512/1024/2048-point FFT/IFFT processor is considered in this paper. Besides, while the power consumption is critical for portable systems, the FFT/IFFT processor for WiMAX applications should be power-efficient. There have been many researches on low-power FFT designs by employing the cached-memory architecture to reduce the memory accesses [2] , [3] . However, the increase in wordlength [2] or idle cycles [3] 
II. ALGORITHM
The N-point discrete Fourier transform (DFT) of a complex input sequence x(n) can be defined as: [2] , [3] . The modules of the proposed design will be described in more detail below. A.
Main Memory For memory-based FFT processors supporting consecutive I/0, multiple main memories are needed as computation and I/0 buffers [7] . To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT. Although CF FFT can reduce memory size by doing I/0 operation concurrently in a single memory, it requires additional controls for memory addressing and butterfly units (BU). This is because the original CF FFT adopts radix-4 and radix-2 algorithms which have different bit-reverse orders. In our proposed design; however, CF memory architecture causes no problem since radix-23 and radix-22 algorithms have the same bit-reverse order as radix-2 algorithm [4] . As shown in Fig. 4 , one 4096-word SRAM works as the I/0 buffer while the other one works as the processing buffer, and vice versa. Each SRAM is further partitioned to eight banks to support eight accesses simultaneously for radix-23 algorithms.
B.
Ping-Pong Cache-Memory Architecture Cached-memory FFT [2] , [3] is proposed for low power consumption by reducing the memory accesses. As shown in Fig. 5 , data are first read from main memory and then sent to the cache. By proper data scheduling, PE can perform multiple-stage processing by accessing local cache instead of the main memory. Although cached-memory FFT can reduce memory accesses effectively, a concurrent read/write cache with complex control is required to increase the throughput. Thus we propose the ping-pong cache-memory architecture which uses a simple cache with single read/write operations. As illustrated in Fig. 6 , data read from the main memory are processed by PE first and then written to the cache for future use. After the cache is full, data in the cache are read by PE and the computed results are stored back to the main memory. Since radix-23 algorithm is adopted in the proposed design, a 64-word cache is employed to support two-stage radix-23 processing. By using this scheme, half the memory accesses can be saved. Moreover, the ping-pong cache-memory has shorter latency compared to the cached-memory, which is beneficial in scheduling data streams. C.
Processing Engine (PE) The PE is designed to perform radix-23/22/2 butterfly operations and complex multiplications with proposed block scaling approach as shown in Fig. 7 . Since variable-length FFT must be supported and the final stage can be radix-23, radix-22, or radix-2 as described earlier, a configurable radix-23/22/2 butterfly unit capable of processing one radix-23, two radix-22, or four radix-2 is adopted. We use 2048-point FFT mode to describe the control of PE. At the fist processing stage, since the inputs have the same decimal point, data alignments are skipped. Input data are processed by radix-23 BU directly and then passed to the first overflow detection and scaling unit (ODSU1) in Fig. 7 . If an overflow is detected, all eight inputs will be scaled and the corresponding shift in exponent is sent to the block scaling unit. Afterward, the output of ODSU1 is sent to the complex multipliers for twiddle factor multiplications. The outputs of the complex multipliers are passed to the second overflow detection and scaling unit (ODSU2) in Fig. 7 where the same operation of ODSU1 is performed. The second and third stages have similar control flows as stage 1. For stage 4, after inputs are aligned in decimal point for processing, two radix-22 operations are performed. At this stage; however, only scaling is performed in ODSU1 since the final output is fixed-exponent in our proposed block scaling algorithm. Complex multiplications and ODSU2 are also skipped in this stage because no twiddle factor multiplication is required at final stage as shown previously in Fig. 2 . The detailed control flow for all 128 2048 FFT modes is summarized in Table I [9] use a shorter wordlength of 12 bits since it only supports for 9-bit input. The processor [8] has employed the BFP approach and thus the wordlength is not increased. However, both designs [8] , [9] do not employ a cache design to reduce the power of memory accesses. From this comparison, it is shown that our proposal has a satisfactory result in both normalized area and FFTs per energy, which justifies the feasibility ofthe proposed method. 
