In this paper, we present 64/128/256/512 point inverse fast Fourier transform (IFFT)/FFT processor for single-user and multi-user multiple-input multipleoutput orthogonal frequency-division multiplexing based IEEE 802.11ac wireless local area network transceiver. The multi-mode processor is developed by an eight-parallel mixed-radix architecture to efficiently produce full reconfigurability for all multi-user combinations. The proposed design not only supports the operation of IFFT/FFT for 1-8 different data streams operated by different users in case of downlink transmission, but also it provides different throughput rates to meet IEEE 802.11ac requirements at the minimum possible clock frequency.
Introduction
Recently, the demand for wireless communications has been increasing explosively, especially for the Wireless Local Area Networks (WLAN). Although the commercially available IEEE 802.11n-based products have a maximum gross data rate of 540 Mbps [1] , this technology can not afford the required through-5 put for the bandwidth hungry applications such as large file transfer and uncompressed multimedia streaming over IP [2] . As a consequence of the market pressure, IEEE has offered the fifth generation WiFi networking standard, IEEE 802.11ac, to deliver data rates faster than the gigabit Ethernet.
The new generation WLAN uses Orthogonal Frequency Division Multiplex-10 ing (OFDM) technology, similar to 802.11n, to cope with severe varying channel conditions and narrowband interference. OFDM is a multi-carrier transmission method that proven as a reliable and effective transmission method. The subcarriers are not only overlapped to provide high spectral efficiency but also the Discrete Fourier Transform (DFT) and its inverse are utilized to create multiple 15 orthogonal subcarriers [3] . To reduce the effect of Inter-Symbol Interference (ISI), a Guard Interval (GI) or a Cyclic Prefix (CP) is introduced for each OFDM symbol by copying a number of time domain samples from the symbol rear to the symbol head. In OFDM, each subcarrier carries a complex symbol that is usually modulated by the well-known Quadrature Amplitude Modula-20 tion (QAM). The advanced Multiple-Input Multiple-Output (MIMO) technique can be further employed to increase the number of spatial parallel streams so that the total throughput is extremely amplified.
The 802.11ac standard guarantees a Very High Throughput (VHT) by enhancing many features to 802.11n. To support a maximum data rate of 6.933
25
Gbps [4] , the system has to utilize 256-QAM modulation, 160 MHz channel bandwidth, and a maximum of eight parallel transmission streams. To improve the receiver performance, the Low Density Parity Check (LDPC) code is added as an optional Forward Error Correction (FEC) mechanism to the existing Binary Convolutional Code (BCC). In addition, the downlink multi-30 user MIMO (MU-MIMO) is a new feature to WLAN that has been added by 802.11ac [4] . Here, multiple independent streams can be transmitted to multiple users at the same time over the same frequency band unlike the single-user MIMO (SU-MIMO) in which multiple independent streams, over multiple independent antennas, can be transmitted to a single user at the same time. In 35 802.11ac, MU-MIMO system supports four users with up to four spatial streams per user with the total number of spatial streams not exceeding eight.
By considering this radical change in wireless network throughput, new challenges arise for the system implementation with respect to the computational capacity and the power consumption. Among various building blocks, the DFT 40 processor is one of the highest computational complexity modules in the physical layer of the IEEE 802.11ac standard. Up to the authors best knowledge, current research efforts are not mature enough to provide a complete complexity analysis for the physical layer of IEEE 802.11ac. However, the transform blocks in an IEEE 802.11n transceiver, which is a similar system to 802.11ac, are the second 45 complex modules of the whole system after the channel decoder [5] . Therefore, a special attention has to be paid for these interesting blocks. The most popular algorithm for computing the DFT is the Fast Fourier Transform (FFT) algorithm which eliminates the redundant calculations needed in computing DFT and is thus very suitable for efficient hardware implementation. Various FFT 50 architectures such as the pipelined architecture [6] and the array architecture [7] have been proposed. In fact, there has been extensive research on the hardware implementation of the FFT algorithms. However, high-throughput multi-stream applications and especially 802.11ac were not generally given more attention. but the maximum throughput is 160 MSamples/sec which is compliant with 802.11n. Furthermore, 1-8 streams are supported by [9] for only 256-point FFT and 300MHz channel. Since special Digital Signal Processing (DSP) resources are offered by modern Field Programmable Gate Array (FPGA), a reconfigurable processor is proposed by [10] . However, this processor is inconvenient for 60 high throughput and multiple stream applications.
According to IEEE 802.11ac standard, the minimum OFDM symbol period is 3.6µsec where the guard interval is 400nsec. Therefore, the maximum required throughput for the FFT processor at the receiver side is 1.138 GSamples/sec where 512-point with 8 simultaneous data sequences are processed. To support this requirement, a software defined architecture is presented [11] . Although that is the first recoded real-time configurable architecture for 802.11ac FFT and it effectively reduces the resources, the minimum supported clock frequency is 330 MHz which is a source of high power consumption. Also, this work does not address the parallel stream processing provided by different users at the 70 downlink transmitter. Moreover, up to our knowledge, all the high-throughput multi-stream architectures including [11] assume fixed bus width for the processor. Indeed, the compromize problem between the system performance against the hardware complexity has to be addressed accurately to develop an optimum design.
75
In this paper, we present a configurable VHT multi-stream IFFT/FFT processor for both downlink and uplink transceivers. The problem of a unified bus width across the design has been tackled by applying a simulation-based optimization procedure that relies on practical Signal to Quantization Noise Ratio (SQNR) values which have been carefully designed to achieve the proper system 80 performance. The algorithm takes into account the input statistics and all quantization error sources to provide the best combination of wordlengths across the processor. In IFFT mode, the MU-MIMO is supported by scheduling different stream units to different users based on the operating mode. Proper resource sharing between different streams is employed to reduce the overall gate count.
85
The power has been minimized by reducing the operating clock frequency to the sampling frequency. Thus, the processor is designed in a fully pipelined architecture whose issue rate is exactly one. This guarantees that the proposed dedicated circuit-based architecture operates by the minimum acceptable clock frequency. Moreover, the unused modules are switched off by the control to fur-90 ther reduce the required power. We even push further to consider the challenge when FPGA implementation is the target platform. The paper is organized as follows. Section 2 summarizes the IEEE 802.11ac system model and provides the background for the mixed radix IFFT/FFT algorithm. In section 3, the proper architecture is selected to support the required 
Mixed Radix FFT Algorithm
Given a complex input sequence x(n), an N -point DFT is defined as given by (1) and the N -point inverse DFT can be written as given by,
Where W The reason is that more arithmetic operations have to be involved for each radix-r stage, wider bit-widths are typically required for internal variables, and the control path needs to handle more ports for each radix-r stage.
The mixed-radix algorithm is the best in terms of the multiplicative complex-155 ity for N -point FFT when the number of DFT points is configurable [14] . This algorithm decomposes a length-N FFT into multiple radix stages to effectively reduce the number of complex multiplications. In general, if N is decomposed such that N = r α 1 × r β 2 , r 1 = r 2 , and α and β are two non-zero integers, then a mixed radix decomposition is achieved. In this case, the DFT defini-160 tion can be rewritten as given by (2) where we substitute for n = r be decomposed into β stages of radix-r 2 . Again, the objective is to minimize the complexity and, at the same time, to support reconfigurability by allowing 165 different numbers of α and β.
Processor Architecture
We propose a configurable 64/128/256/512-point IFFT/FFT processor to support 1-8 simultaneous data sequences for a MIMO OFDM system that is compliment with 802.11ac standard. Furthermore, the processor can process There are recent researches that optimize the FFT architecture for simulta-180 neous streams processing such as [8] and [9] . The basic assumption is that different streams are processed concurrently and there is no possibility that different streams are not time-aligned. In fact, this approach can not satisfy the requirements for MU-MIMO at the transmitter side in which separate user streams have to be processed independently. To support this feature, the processor is 185 structured such that parallel independent units are employed to process the par-allel independent streams. There are two degrees of freedom for the processor reconfigurability that enables the processor to function in either SU-MIMO or MU-MIMO. The number of FFT points is configurable for each stream and the number of active parallel streams can be also controlled.
190
Furthermore, one recent design for the FFT engine of IEEE802.11ac has been presented in [15] . Although this design is an Application Specific Integrated Circuit (ASIC) design, it is beneficial to discuss the architecture and to study its convenience to the FPGA environment. This architecture utilizes eight parallel The operating clock frequency is designed to be 40MHz only which implies 205 that higher sampling rates are not supported by such a processor. For FPGA environment, storing eight complete symbols in registers is mainly a utilization issue due to the limited number of logic elements for sequential processing.
Mixed Radix Selection
The mixed radix architecture is employed to support this configurable IFFT 210 /FFT unit for each stream. In fact, the most challenging task is to properly choose the mixed radix system including the number of required radices, the values of those radices, the number of stages for each radix, and the order of different stages. The selection criterion aims at deploying a low complexity design that is fully configurable and optimized under the umbrella of the architecture should be able to support all different configurations. (2) Higher radix is preferred to minimize the number of required twiddle multipliers. (3) The quantization noise sources should be kept as minimum as possible. constant multiplication is involved, the system performance is influenced by the induced quantization noise from the multiplier output quantization. Based on this investigation, the conclusion is that the best practical mixed radix system for 802.11ac multi-mode IFFT/FFT processor is radix-4/radix-2 system. It does not only fit the required specifications, but it also introduces the best 245 compromise between complexity, performance, and simple implementation. FFT designs [19] and hence an optimization technique is also presented as a 315 result of this analysis. However, the procedure assumes an overestimated quantization error to have a safe but not the optimum fixed-point model. Moreover, traditional FFT fixed point researches do not particularly elaborate on the basis by which the ADC is quantized as it is mainly an application-specific problem.
All these approaches and others such as [20] and [12] assume that the input 320 is uniformly distributed to simplify the analysis. On the contrary, multi-path signals are received at the OFDM receiver and hence the input distribution is usually considered as Gaussian distributed. This assumption has been investi-gated in details in [21] where the SQNR is analyzed for different received signal power and different wordlengths. However, this work aims only at introducing The ratio, R QN , between the quantization noise power and the traditional noise plus interference power. This ratio considers the amount by which the tradi-340 tional noise plus interference dominates over the quantization noise power so that the detection procedure is not sensitive to the quantization noise. Enlarging this parameter will enhance the performance but will also result in higher system complexity as more bits will be involved. The SQNR at the block output due to the input quantization can be obtained for the selected input bit-width. The input has to be generated with the same distribution and same power as the integrated model. This SQNR will be the target design metric or the reference SQNR for the block because for i = 1 : length(N Bits ) do 5:
Simulate to obtain SQNR(i) 7: end for
8:
Find the smallest k where SQNR(k) ≈ SQNR REF
9:
Freeze N b = N Bits (k) for the current variable 
FFT Processor Implementation

380
In this section, we present the hardware implementation aspects for the proposed architecture. First, when enabled, Radix-2 stage applies simple addition and subtraction to two successive inputs. The outputs are then serialized to be fed to the reorder block. Second, the traditional twiddle multiplier uses four real multipliers, one addition, and one subtraction. To reduce the complexity,
385
the multiplier architecture presented in [20] has been utilized where only three multipliers, three adders, and two subtraction units are involved. The remaining building blocks will be discussed in details.
Radix-4 Module
To achieve the full functionality of the reconfigurable architecture, Radix- 
Reordering Block
435
The reorder block architecture is shown in Fig. 5 . To allow an issue rate of one, two ping-pong memory modules are used to accept a continuous stream of input samples such that the processing for two FFT windows is overlapped. The idea is that the first FFT window is placed in the first module and the following FFT window occupies the second memory module. While writing the samples 440 into the second module linearly, the first module is read by the shuffling address.
When the second module is filled, the next FFT window is written to the first memory module and the second memory is read by the interleave address. This mechanism allows a continuity for the input sequence and output sequence so that one sample is processed every clock and the reorder process never stalls.
445
The internal control is responsible to respond to new valid inputs by updating the linear address, the interleave address, the multiplexers control, and memory modules enables.
Twiddle Factor Generator
Twiddle factors are generated by the control units that manage different The twiddle factors can be divided into four groups where each group corresponds to one quarter of the full complex exponential waveform. The first group is of the trivial type which is compensated by skipping the twiddle factor multiplication. The second group is directly generated by directly accessing the 465 tables. Since the complex exponential of third group is simply double the frequency of the complex exponential of the second group, the third group samples can be viewed as a decimated version of the first group samples with a decimation ratio of two. Therefore, the ROM address is doubled to accept a sample and drop a sample. The same concept is applied for the last group in which the 470 decimation ratio is three. In all cases, proper sign and address adjustments are required to maintain the expected functionality. When the radix-2 is inactive, half of the normal twiddles are required. The decimation concept is applied by originally considering the up-counter value as twice as the original value that has been generated for the normal mode. 
Switching Network
The switching network is the block that distributes the control signals com- 
Results
In this section, we introduce the fixed point optimization technique to the 802.11ac IFFT and FFT modules to obtain the optimum wordlengths for all IFFT and FFT variables. The final fixed point model for the IFFT/FFT pro-500 cessor is implemented and verified on FPGA. The implementation shows that all 802.11ac timing constrains are met by the presented architecture.
Fixed Point Simulations
To obtain the optimum wordlengths for the processor, a simulation environment that is compliant with 802.11ac specifications is coded in MATLAB where 505 10 4 OFDM data symbols are generated for each SNR value. The worst case scenario has been configured such that 160MHz channel is used (i.e. 512 subcarriers). Among those subcarriers, there are 468 data subcarriers that carry 
510
The floating point model is adapted and verified against the theoretical system performance in Additive White Gaussian Noise (AWGN) environment.
Since the uncoded performance for 256-QAM at BER = 10 −5 is achieved at SNR ≃ 31.22dB, then the ADC requires at least 11 bits if we assume reasonable and practical parameters such as R P = 7dB, R QN = 20dB, and SM = 3dB.
515
To verify this result, we run a system simulation where the wordlength of the FFT input is swept from 7 to 11 bits and AGC back-off is assumed to be 14dB.
As shown in Fig. 7 , the quantization noise dominates at higher SNR and more loss is obtained by decreasing the bit-width. At BER = 10 −5 , the performance loss is about 0.18 and 0.02 dBs for 10 and 11 bits, respectively. In fact, this 520 difference is not observable because of the flatness of the channel which does not contribute to the larger dynamic scale required to compensate the fading of the real channel. Therefore, 11 bits are enough for the 802.11ac FFT input when realistic factors are considered.
A standalone simulation is applied to the FFT module only to provide 525 wordlengths for the internal variables. The FFT input signal is assumed to be Gaussian distributed whose power is adjusted such that AGC accommodates for 14dBs back-off. When the input is the only quantized variable to 11 bits, the SQNR is measured to show a reference SQNR of 57.07dB. Following the algorithm shown in Fig. 3 cessor model is tested as an IFFT module by applying normalized 256-QAM modulated symbols at the input, the resulting SQNR shows 56.36dBs. In this case, the overall loss is found to be 0.05dBs at BER = 10 −5 when we simulate the integrated model involving both IFFT and FFT modules in fixed point.
Implementation
545
Based on the wordlengths obtained by the fixed point optimization, the architecture of the proposed FFT processor is modelled in Verilog and functionally verified using Modelsim simulator. Test vectors have been generated by the ref-
erence fixed point MATLAB code and provided to the verilog as input files.
The expected outputs are also captured into output files which are loaded by 550 the verilog testbench to test the module functionality. Of course, the verilog has been revised till a match is obtained between the reference fixed point MATLAB code and the synthesizable verilog code. The design is then synthesised onto to produce the binary files for the testing board. The translated logic is placed and routed into the target device where place-and-route simulations guarantee the processor behaviour when timing is involved.
Although it is unfair to compare different processors that have different features, it is useful to analyze each design based on another design perspective.
560
The idea is that our design supports the MU-MIMO feature. Other processors assume simultaneous stream processing with proper time alignment which is not the case for MU-MIMO IFFT. However, the software defined processor [11] and Xilinx FFT core [23] which is used for streaming applications can support the required timing requirements. Therefore, our design is evaluated with respect 565 to these implementations. Table 2 lists the cost and performance for our design, the software approach (shown in parentheses), and the Xilinx design (shown in brackets). The operating clock frequency is the same as the sampling frequency since the pipelined processor achieves an issue rate of one similar to the Xilinx processor. The software solution requires an operating frequency of 320MHz to 570 account for the 160MHz channel. This processor employs a time share procedure between the process elements to reduce the complexity on the account of increasing the clock frequency.
Our design has the smallest utilization for the number of DSP units. The reason is that our design first utilizes higher radix system which reduces the 575 number of complex multiplication. Second, we have implemented each complex multiplier using only three real multipliers in addition to an increase of the logic utilization to account for extra additions. Third, we optimized the twiddle multiplication for the last radix-4 stage by utilizing simple shift and add methodology for the twiddle multiplication. Since other realizations depend on 580 radix-2 design, they consume a large number of complex multiplication.
For the utilization of the logic elements, our design saves 6% of the corre- It is clear from the above comparison that our design has a good compromize between area, speed, and power. However, to provide more clarity about our design and to eliminate any confusion, we have compared the presented work to the most recent designs for ASIC application such as the work presented in [15] .
600
First, we would like to emphasize that the number of memory instances in our design is large due to the separation of the input and output buffers at every stage. Although this is not a critical issue for FPGA designs, it really requires a second look if ASIC is the target. One option is to concatenate the real and imaginary data into a single memory instance by providing wider bus. One 605 other limitation could be the shuffling network which consumes huge amount of wiring. This typically causes a routing issue at the back end. However, dividing the shuffling network into layers may be a good solution.
Conclusion
A high-throughput configurable 64/128/256/512-point IFFT/FFT proces- comparison with other implementations is also considered.
