This work describes the design, implementation, and performance evaluation of an orthogonal frequency division multiple access (OFDMA) time-division duplexing (TDD) physical layer (PHY) compliant with the worldwide interoperability for microwave access (WiMAX) standard using a costeffective software-defined radio (SDR) platform containing field programmable gate array (FPGA) and digital signal processor (DSP) modules. We show that the proposed SDR architecture is capable of supporting the wide variety of configuration options described in the WiMAX standard while fulfilling the stringent requirements of WiMAX OFDMA TDD PHYs. The architecture allows for the implementation of all TDD functionalities in the downlink and the uplink at both the base station and the mobile station. The proposed design is shown to efficiently use the available FPGA and DSP resources. We also carried out specific experiments that take into account the frame and the downlink map messages detection over ITU-R wireless channel models to illustrate the performance of the proposed design. Finally, we discuss the utilization of the proposed hardware architecture to implement the wirelessMAN-advanced air interface.
802.16 historical evolution up to 2010, see [2] and references therein. The most important 802.16 amendments are 802.16d, released in 2004 for point-to-point applications and commonly known as fixed WiMAX, and 802.16e released in 2005 and referred to as mobile WiMAX because it supports mobility and multiple users.
In 2011, the WiMAX standard evolved to amendment 802.16m [3, 4] which focuses on enhancements related to air interface specifications to fulfill the requirements and performance goals established by IMT-advanced while maintaining full backward compatibility with previous WiMAX versions. In August 2012, the latest revision of WiMAX was published and termed 802. 16-2012 [5] . This revision consolidates material from amendments 802.16j-2009 and 802.16h-2010 and also incorporates 802.16m-2011 but excluding the wirelessMAN-advanced air interface, which is now specified in the IEEE Std 802.16.1-2012 [6] . The latest amendments to the standard are 802.16p-2012 [7] and 802. 16 .1b-2012 [8] , which incorporate improvements to support machine-to-machine applications.
WiMAX supports several physical layer (PHY) modes. In particular, the most attractive PHYs are those that for 802.11n and 802.16e [22] , and a 802.16m and LTE downlink implementation [23] .
This work differs from existing ones in the literature because it presents a hardware architecture for the implementation of both the downlink and the uplink of OFDMA-TDD PHY for WiMAX applications. We discuss a large number of practical issues and show how they can be solved to fit into the proposed hardware architecture. Although most of the work focuses on mobile WiMAX, we also explain how the proposed architecture can be used to implement the recently standardized wirelessMAN-advanced air interface.
The remainder of this article is organized as follows. Section 2 provides a brief description of the OFDMA-TDD mobile WiMAX PHY. Section 3 describes the proposed hardware architecture for the implementation of an OFDMA-TDD PHY compliant with the mobile-WiMAX standard. Section 4 presents the amount of FPGA and DSP resources consumed by an implementation made with Xilinx system generator while Section 5 is devoted to its experimental evaluation over ITU-R wireless channel models. Section 6 explains how the proposed hardware architecture can be used to implement the PHY of the WirelessMAN-Advanced Air Interface. Finally, Section 7 presents the concluding remarks.
Mobile WiMAX physical layer
This section describes the primary features of the mobile WiMAX PHY to be used in the ensuing sections. For a more detailed description, see [24] .
Mobile WiMAX is the first of the WiMAX standards to use an OFDMA-TDD PHY to support several users at the same time. Among all IEEE 802.16e profiles, mobile WiMAX selected a subset of five whose fast Fourier transform (FFT) size, bandwidth, and sampling frequency values are shown in Table 1 . Figure 1 shows the basic building blocks of an IEEE 802.16e transmitter. In the TDD mode, each frame contains one downlink subframe and one uplink subframe, and both use OFDM modulation. The downlink subframe is preceded by a preamble symbol whose subcarriers are chosen from a predefined set. The first two symbols after the preamble are reserved to send the frame control header (FCH) and downlink map (DL-MAP) messages, which describe the mapping of the bursts inside the downlink subframe. If an uplink map (UL-MAP) message is sent to set the uplink configuration, it should be transmitted on the first burst defined in the DL-MAP. The mapping of bursts into subframes can be done using different permutation schemes such as partial usage of subcarriers (PUSC) and full usage of subcarriers (FUSC). The slot unit is the minimum possible data allocation unit, and it is used to specify the data time-frequency regions of the bursts. Depending on the permutation scheme used, a slot is defined in a different way although it always encompasses 48 data subcarriers.
Uplink resources are shared among mobile stations (MSs), and their allocation and scheduling are centralized on the BS. The latter decides how many slots are assigned to each MS depending on their QoS parameters and bandwidth requirements. Additionally, rectangular time-frequency-shaped regions can be defined in the uplink to allow MSs to perform network entry, improve uplink synchronization parameters, or send special feedback messages, among other tasks.
Data and pilot carriers transmitted in either the uplink or the downlink go through a process of scrambling just before the inverse fast Fourier transform (IFFT) operation, and then a cyclic prefix (CP) is appended at its output. The size of this CP is defined as a ratio of the FFT size and can be variable, being valid values 1/4, 1/8, 1/16, and 1/32, although theWiMAX Forum only requires the support of the 1/8 value.
The channel coding procedure has five steps: randomization, forward error correction (FEC), bitinterleaving, repetition coding, and modulation. Variable coding rate and modulation are supported to enable adaptive modulation and coding (AMC) capabilities.
Mobile WiMAX physical layer design and implementation
This section describes the design and implementation of an OFDMA-TDD PHY compliant with the mobileWiMAX standard. We focus on the mandatory parts of the standard for both the BS and the MS, i.e., OFDMA frame structure, PUSC permutation scheme in downlink and uplink subframes, ranging, and channel coding with tail-biting convolutional codes (TBCC). Figure 2 plots the block diagram of the hardware architecture set for each terminal station while, at the same time, showing the location of each system task and the connections between them. The arrow numbers indicate the number of bits of the communications between tasks, e.g., 16 × 2 represents complex numbers with 16 bits precision for each component.
Hardware description
Both BS and MS were implemented using the same hardware elements, namely three commercial off-theshelf (COTS) modules placed on a peripheral component interconnect (PCI) carrier board, as shown in Figure 2 . The first module contains a Texas Instruments TMS320C6416 DSP together with a Xilinx Virtex-II XC2V2000 FPGA. The second module is an FPGA Xilinx Virtex-4 XC4VSX55, and the third module contains an FPGA Xilinx Virtex-4 XC4VSX35 and an analog add-on module with two digital-to-analog converters (DACs) and two analog-to-digital converters (ADCs). The DACs are Texas Instruments DAC5686 [25] , with 16 bits of precision and a maximum sampling rate of 160 Msample/s. The ADCs are Texas Instruments ADS5500 [26] , with 14 bits of precision and maximum sampling rate of 125 Msample/s. Both Xilinx Virtex-4 XC4VSX55 and Xilinx Virtex-4 XC4VSX35 FPGAs are provided with a large number of embedded multipliers allowing for intensive signal processing operations. Two kinds of buses were used for communicating between modules: data buses and control buses. The latter are exclusively used for configuration messages. The throughput of the data and control buses is 400 and 20 MB/s, respectively. The communication between the host PCs and their corresponding carrier boards is done through the PCI bus.
It is important to mention that all calculations in our implementation are done in fixed point with 16 bits of precision since there was no need to use less bits. On the one hand, no saving is obtained in the DSPs if less bits are used, and on the other hand, our design already fitted into the FPGAs doing calculation with 16 bits of precision.
Digital up/downconversion
The digital up converter (DUC) and the digital down converter (DDC) are responsible for adapting the signal to the ADCs and DACs sampling rate and I/Q modulation/demodulation. During upconversion, the following tasks are done: upsampling, pulse shaping, and I/Q modulation to a configurable intermediate frequency. The downconverter performs the complementary operations in inverse order, i.e., I/Q demodulation, filtering, and downsampling.
In the proposed OFDMA-TDD WiMAX PHY layer design, the profiles selected by the WiMAX Forum are supported by means of five different bit streams to the FPGAs, each one with a different up/downsampling factor. The converters sampling frequency is fixed at 80 MHz. Hence, the up/downsampling factors to obtain profiles from #1 to #5 are 20, 100/7, 10, 8, and 50/7, respectively. In order to efficiently implement these sample-rate conversions, each FPGA bit stream has a different optimized combination of interpolation/decimation filters as explained in [27] . http://jwcn.eurasipjournals.com/content/2013/1/243
Downlink synchronization
Frame and symbol detection are key operations to be performed at the MS. In the herein proposed design, frame and symbol detection are carried out using the correlation properties of the preamble and the WiMAX OFDM symbols, respectively. Figure 3 plots the block diagram of the synchronization subsystem implemented in the MS.
Since ADCs are not equipped with a programmable gain amplifier (PGA), normalization of the received signal is performed after the DDC stage. This is done by first computing the average power of the received signal and then applying the resulting value as a constant scale factor during the whole downlink subframe after synchronization. This normalization strategy has been selected because it provides a good compromise between clipping and quantization errors. The frame detection time is also fed to an uplink transmission control block which schedules the emission of the uplink subframe taking into account the subframes size and the transmit/receive transition gap (TTG) and receive/transmit transition gap (RTG) guard intervals.
The energy estimations computed during the first 1,024 samples after the preamble and during the RTG guard interval are stored in a configuration register to allow for their reading from the DSP. These values are eventually used to estimate the signal-to-noise ratio (SNR).
Preambles in mobile WiMAX have a fixed structure with two guard subcarriers inserted between each pilot subcarrier whose values are chosen from a predefined set depending on the segment and the BS cell identifier. This structure results in a threefold repetition of samples in the time domain that can be exploited to detect the beginning of a new frame through the following repetition property-based (RPB) autocorrelation metric [28] 
where r(n) is the complex-valued baseband received signal and N is the FFT size. When the preamble of a downlink frame is received, this metric reaches its maximum value and keeps this value during a plateau. The presence of this plateau indicates the incoming of a new downlink frame. The particular sample at which the FFT window starts can be determined making use of two additional metrics. The first one is the CP autocorrelation metric defined as
where G is the length of the CP. The second metric is the quantized cross-correlation (QC) that calculates a crosscorrelation between the quantized received signalr(n) and the last 64 quantized preamble samples in the time domain,p(k), i.e.,
Quantization consists of mapping the input signal and the preamble into −1, 0, and 1 values to avoid the use of complex multipliers and reduce correlation calculation complexity. The previously defined three correlation metrics are combined together to determine the frame starting time, θ , as followŝ
Notice that since the received input signal is normalized, the maximum of this function can be easily determined as the sample time when this function overcomes a predefined threshold value. The autocorrelation metrics R RPB (k) and R CP (k) can also be used to obtain estimates of the frequency offset. Indeed, two frequency offset estimations,φ RPB andφ CP , can be obtained by normalizing the phase of the autocorrelation metrics at the frame starting time,θ, with respect to the subcarrier spacing [29] , i.e.,
, and
These two values can be successfully combined to enhance the accuracy of the frequency offset estimate. The preamble autocorrelation frequency offset estimateφ RPB provides a frequency offset window in which the exact value can be determined from the CP autocorrelation frequency offset estimate,φ CP . In a general form,
whereφ COMB is the combined frequency offset estimation. In the above expression, the frequency offset estimation range ofφ CP goes from − . Whenφ CP is out of its range, its value should be adjusted by adding or subtracting multiples of 1/N until it matches the value obtained with thê φ RPB metric.
Ranging and uplink synchronization
In multiuser mobile environments, time and frequency estimations obtained at MSs cannot be directly used to construct the uplink signal because the relative distance and speed with respect to the BS are not known [30] . In the IEEE 802.16e standard, this problem is solved with the so-called ranging process. In such a process, MSs transmit pseudonoise (PN) sequences generated from a shift register in specific regions of the uplink reserved for this purpose by the BS in a contention-based policy. At the receiver side, the BS must detect the arrival of a ranging code and estimate the synchronization parameters from it. Finally, these parameters are sent back to the MSs in a medium access control (MAC) management message and used to construct the synchronized uplink frames to be transmitted by the MSs.
Two types of ranging regions are defined: initial ranging, used during network entry, and periodic ranging, used when the MSs are already connected. In the case of initial ranging, OFDM symbols containing ranging codes must be transmitted by MSs in pairs, the first symbol with a CP and the second one with a cyclic postfix, hence allowing a wider time synchronization window. In our implementation, the mobile station has a special version of the IFFT block which can receive as a parameter the pattern of cyclic prefixes and postfixes of the sent symbols to accomplish this requirement.
Ranging codes, p c (k), are sequences of 144 BPSK symbols generated from the output of a pseudorandom binary sequence (PRBS). Different sets of codes are used depending on the purpose of the MS: initial ranging, periodic ranging, bandwidth requests, or handover. When a MS decides to start a ranging process, it selects a code randomly from the corresponding set and then maps it to a ranging region. This mapping in a PUSC zone is done in a distributed fashion, and only groups of four symbols are guaranteed to be transmitted in contiguous subcarriers. The BS must identify the ranging code sent by the MS in order to estimate the uplink synchronization errors.
Code detection and time offset estimation in the BS is done in the frequency domain over each OFDM symbol. Let X(k) represent the 144 BPSK received symbols in a ranging subchannel of a single OFDM symbol. An energy threshold is first applied to them to avoid further processing [31] . When the energy threshold is reached, a cross-correlation of the received symbols, X(k), with all possible ranging codes, p c (k), is used to determine the ranging code index c. If we denote t c (k) = X(k)p c (k), the product of the received symbols times the cth ranging code, we can write the lag-one autocorrelation of t c (k) for groups of four consecutive subcarriers as follows [32] :
where T is the number of tiles of a ranging code a . If we assume that the channel coefficients are similar in adjacent subcarriers, the effect of the channel is canceled in R(c) and only the residual time offset remains. This way, we can define estimators for the ranging code and the time offset of the uplink signal as follows:
In the literature, several uplink frequency offset estimation algorithms can be found. These algorithms can be divided into three groups, from lower to higher computational complexity: subband, interleaved, and generalized http://jwcn.eurasipjournals.com/content/2013/1/243 allocation of subcarriers. Ranging in mobile WiMAX is an example of generalized allocation where the subcarriers reserved to the ranging process can take up any position in the available spectrum. The algorithms defined for this kind of structures are based on a joint maximum likelihood (JML) estimation of the channel response and the frequency offset but with a very high complexity [30] . Notice that the uplink synchronization algorithms selected for our design avoid the complexity of JML algorithms by exploiting the redundancy present in the ranging codes.
Once the ranging code is known, frequency offset can be extracted through reconstruction of the transmitted signal sent by the mobile station. To do so, the received PN sequence is mapped back to the OFDM symbol. Since the initial ranging forces mobile stations to transmit the same ranging code twice in two consecutive symbols, this property can be used to extract the frequency offset through a correlation computation.
Subchannelization and channel equalization
Tasks related with the OFDM modulation are placed in the Virtex-4 SX55 FPGA module. The most important operation is the FFT, which has been implemented using the Xilinx LogiCORE IP fast Fourier transform [33] , allowing for run-time configuration of the transform point size.
Subchannelization in WiMAX involves three operations: interleaving, randomization of subcarriers according to some permutation scheme, and pilot insertion. This structure is specified in the DL-MAP and UL-MAP messages sent by the BS in each frame. As described in Section 2, the DL-MAP message is always mapped on the first two symbols of the downlink subframe, hence providing a complete description of the permutation schemes used and bursts contained inside the subframe. At the receiver, the task of decoding DL-MAP messages showed itself as a critical one since most of the processing of the downlink subframe at the receiver cannot start until this message is completely decoded. On the other hand, the randomization of subcarriers in the uplink cannot be applied to the ranging bursts. As a consequence, this process depends entirely on the uplink burst scheme defined by the BS.
Taking these issues into account, we decided to implement the subchannelization and channel equalization processes in the DSPs to provide maximum flexibility regarding FFT sizes, burst mapping, and eventual support of other permutation schemes. In the MS, the extraction of DL-MAP messages is optimized through the different design layers to minimize the delay of the decoding pipeline rather than implementing a hardware low-level MAC for this purpose [19] .
The selected channel estimation and equalization algorithms are piecewise linear channel coefficients interpolation and zero forcing, respectively. Several analysis of channel estimation and equalization algorithms for WiMAX can be found in the literature showing that the selected method offers an acceptable performance in terms of mean squared error (MSE) and bit error rate (BER) with a low complexity implementation [34, 35] . In the downlink, each symbol is equalized independently in frequency dimension, and in the uplink, all pilot subcarriers in a tile, made up of four subcarriers during three OFDM symbols, are used together to perform this task with a two-dimensional interpolation.
Channel coding
Information bits received from higher layers are mapped into constellation points after a channel coding process that includes randomization and bit interleaving. Additionally, the repetition coding step is performed over the constellation-mapped data in a slot-by-slot manner. In the proposed design, channel coding is mainly implemented in the Virtex-II FPGA, although the optional repetition coding step and the processing control are carried out in the DSP, using the FPGA as a coprocessor. In this work, we focus on the TBCC coding scheme with variable rate and constellation sizes from QPSK, 16-QAM, and 64-QAM, both in the downlink and in the uplink.
The encoder in a tail-biting scheme has a complexity similar to that of a zero-tail encoder. The encoder was implemented adding a CP to each FEC block with a size equal to the constraint length of the shift register (in the case of mobile WiMAX, this value is seven). The decoder has a higher complexity because the starting state of the trellis is unknown before decoding. Maximum likelihood (ML) decoding achieves optimum performance, but it requires decoding the received block starting with all the possible initial states, which increases decoding complexity to unacceptable levels [36] . The implemented channel decoding process uses a suboptimal technique which provides a good compromise between decoding quality and complexity, where the first bits of the block are appended after the block, and the last bits at the beginning of the block [37] . The size of the chunks added at the beginning and at the end of the blocks is equal to the traceback length configured in the Viterbi decoder. If a block is shorter than the traceback length, it is just sent three times to the decoder and only the output corresponding to the second repetition is taken into account.
Additionally, the decoder performs a carrier-to-interference and noise ratio (CINR) estimation based on the demodulated data symbols by computing an error vector magnitude (EVM) measurement. This estimation was implemented in the soft decisor by mapping the soft bits back to symbols, hence obtaining a reliable estimation of the transmitted symbols. Then, the MSE of the received signal and the estimated transmitted symbol is calculated http://jwcn.eurasipjournals.com/content/2013/1/243 and saved in a register in order that the DSP can read the value. This algorithm provides an accurate estimation of the CINR as long as decision errors are kept at low levels. If this is not the case, an overestimation of the CINR will occur.
Physical layer control
The subframes structure is controlled from the higher layers in the BS using a service access point (SAP) protocol and is sent to the MS through MAC management messages (DL-MAP, UL-MAP, downlink channel descriptor (DCD), and uplink channel descriptor (UCD)). This SAP allows for defining the subframes structure, for sending and receiving data bursts, and for transmitting and detecting ranging codes.
The downlink subframe must follow some constraints regarding the permutation zone and burst definitions. First of all, bursts must be time-frequency rectangularshaped and should always span a multiple of two symbols in time and a multiple of a subchannel size in frequency (this is the so-called slot unit according to WiMAX terminology). Moreover, several users can be grouped into a single burst to reduce overhead in the DL-MAP definition and to speed up the generation of bursts. Finally, the BS has to distribute the available resources between users taking into account their QoS parameters.
There are several solutions to face these problems [39] , but in our implementation, the Ohseki algorithm [40] was chosen because of its good compromise between computational complexity and allocation losses. The general idea of this algorithm is to assign all users with equal burst profile to the same burst and to allocate its resources in a frequency-first policy, hence avoiding any burst overlapping in the frequency domain.
Resource management in the uplink is more flexible since it is only necessary to indicate the number of slots allocated to each station with no constraints regarding the time-frequency burst-shape. The allocation size is decided by the MAC layer taking into account the QoS parameters negotiated for connections and the bandwidth requirements sent by the MSs as signaling headers in the uplink.
Resource utilization
The hardware architecture described in the previous section contains three FPGAs and a single DSP per station. FPGAs are the most critical parts, and their size should be large enough to enable the implementation of the tasks assigned to them. Table 2 shows the FPGA resource utilization in terms of slices, LUTs, RAMB16s, and multipliers after the implementation of the previously described OFDMA-TDD mobile WiMAX PHY. FPGA designs were implemented using Xilinx system generator 10.1 and built with Xilinx ISE 10.1. Power consumption estimations of each module were obtained using the Xilinx XPower Analyzer tool, and they are also included in Table 2 . Thanks to the design decisions adopted in the previous section, we were able to successfully implement the whole OFDMA-TDD PHY at both the BS and the MS.
The FPGAs resource allocation shown in Table 2 considers separately the cases of the BS and the MS. The main difference between both designs lies in the synchronization block in the MS, which requires 58% of the slices of the Virtex-4 SX35. The quantized cross-correlation algorithm is the most demanding block inside this synchronization module. Another difference is caused by the ability of the MS to add cyclic postfixes to the output of the IFFT. This requirement is necessary for sending the initial ranging codes. Table 3 shows an estimation of the individual FPGA resource utilization of each block obtained when compiling them separately. Notice that this results were obtained before the compiler applied its global optimizations to the design. Additionally, the operation frequency of each block is also shown as well as the critical path delay of each module. The internal FIFO blocks shown in Table 3 are used to support the communications between the different modules. Also, the TBCC encoder is subdivided into the symbol mapper and the FEC TX blocks, while the TBCC decoder is made up of the soft decisor and the FEC RX blocks.
The Virtex-4 SX55 is a high-resource FPGA that allowed for the implementation of the FFT blocks without a resource-optimized design, hence a pipelined architecture was used allowing for continuous data processing. However, the Virtex-II V2000 is resource limited, which forced us to optimize the FEC design.
Regarding DSP resources, Table 4 shows the memory usage of each task and an estimation of the DSP cycles required for the processing performed inside each task.
The estimation of the DSP cycles is obtained from a static analysis of the assembly code generated by the compiler. We also present an estimation of the time required to execute each task in the last column of the table. These time estimations were obtained making the following assumptions:
• The 8.75-MHz profile is used with 1,024 subcarriers and a cyclic prefix length of 1/8.
• The frame duration is 5 ms, with 25 symbols in the downlink and 18 symbols in the uplink.
• The subframes are used entirely for data transmission.
• Data subcarriers are modulated in 64-QAM and convolutional coding with rate 3/4.
• Every 16 frames, there is a ranging burst of 30 subchannels and 3 symbols.
• The tasks which use the internal DSP memory are executed at 600 MHz, while the tasks that only use ZBTRAM memory are executed at 100 MHz.
• The data copy between the DSP tasks is performed at 800 MB/s. The communication with the FPGAs does not consume DSP time. The estimation of the total DSP time used is 958.56 and 927.89 μs for the BS and MS, respectively. This is an optimistic estimation since we are not taking into account the time consumed by the kernel as well as the context switches and interrupt handling. Furthermore, the delay of the communication with the FPGA and the interdependence between the processing tasks can lead up to long waiting times for FPGA data. This means that a good concurrent processing planning is also needed to fulfill the 5-ms frame duration.
Experimental results
In this section, we present the results of several tests that were conducted to check the performance of the proposed OFDMA-TDD WiMAX PHY implementation. In order to carry out the evaluation in a repeatable as well as in a reproducible way, we set up an evaluation system that uses a channel emulator that implements different timevarying channel models. emulator while the UL is directly connected with a cable. The reverse configuration (i.e., the DL uses a cable while the UL crosses the channel emulator) is used to evaluate the UL.
The channel emulator was implemented on a Xilinx Virtex-4 FPGA using the Xilinx XtremeDSP development kit. As shown in Figure 4 , it consists of a channel coefficient generator, an interpolator, a channel filtering stage, and an additive white Gaussian noise (AWGN) generator. It accepts parameters like the average power and delay of each tap, the noise power, and the intermediate frequency of the input signal. The coefficient interpolation factor as well as the Doppler power spectrum are defined at compilation time, and they are fixed during the emulation.
The channel emulator was configured to implement the ITU-R M.1225 models [41] . Following the recommendations of the WiMAX Forum [42] , four models were considered: pedestrian A (3 km/h), pedestrian B (3 km/h), and vehicular A at 60 and at 120 km/h. A summary of the tapped delay line features of these channel models is shown in Table 5 .
All channel models use the Jakes Doppler power spectrum density, and a 2.4-GHz carrier frequency was assumed for the Doppler spread calculations. The maximum delay of these channels (3,700 ns) does not exceed in any case the default 1/8 CP length (11,429 ns); hence, intersymbol interference (ISI) is avoided. It is important to note that the pedestrian A scenario stands out because it has a low multipath diversity. Multipath diversity is an inherent property of wireless channels that occurs whenever the channel power delay profile is rich enough to provide replicas of the transmitted signal at the receiver input. In time-varying scenarios, the amplitude and the phase of such replicas change over time.
The pedestrian A channel model only contains four paths with the last two being rather attenuated. Furthermore, the path delay spread is rather small so the frequency selectivity of this channel is rather low, hence allowing for a good channel equalization. On the contrary, notice that the pedestrian B and vehicular A scenarios have higher multipath diversity and larger path delay spreads. As explained in Section 3, MSs estimate the SNR of the received signal using the values obtained during the synchronization process. These estimated SNR values were used to calibrate the AWGN generator, hence matching the noise power added in the emulator with the estimated SNR in the MS. This way, the SNR at the receiver is under control in all scenarios.
Table 5 ITU-R M.1225 channel models
First of all, experiments were carried out to evaluate the performance of the frame detection stage. Towards this aim, we transmitted at least 10 4 frames in the downlink direction and counted in the MS the number of frames detected. The frame detection performance over AWGN and ITU-R channels is shown in Figure 5 with 90% confidence intervals for the mean computed using bootstrapping [43] . The best results are clearly obtained over the AWGN channel, with almost perfect detection at a SNR value of 0 dB. In the case of ITU-R channels, the performance degrades significantly due to channel fades. Similar results were obtained for pedestrian B and vehicular A channels where the SNR has to be increased up to 12 dB in order to achieve frame error rate (FER) values below 10 −3 . The worst results were obtained with the pedestrian A channel model because its multipath diversity is smaller than in the other channel models.
During the previous experiments, we also counted the number of DL-MAP messages correctly decoded since this is the criterion in the standard to decide if downlink synchronization was acquired or not. The DL-MAP messages were sent using QPSK, convolutional coding with rate 1/2 and no repetitions, and a size of 28 bytes including the header and the cyclic redundancy check (CRC). In Figure 6 , the DL-MAP decoding FER over AWGN and ITU-R channels is shown. This error ratio is obtained counting as errors not only the frames in which the FCH or the DL-MAP are not correctly decoded but also the undetected frames. Comparing these results to the frame detection error rate in Figure 5 , we conclude that frame detection is not impacting on system performance since the SNR for incorrectly detecting a frame is 5 dB lower than that for incorrectly detecting the DL-MAP.
Next, we evaluated the performance of the uplink timing offset synchronization module. The MS was configured to continuously send ranging codes in the uplink, and the time offset estimations computed at the BS were stored. The result of this test is shown in Figure 7 in which the MSE of the time offset estimations (expressed in number of samples) is shown. It can be seen that the uplink timing offset implementation is insensitive to the features of the different channel models and provides acceptable estimations in all cases, even at low SNR values.
Regarding BER performance, downlink and uplink measurements were done over the AWGN and the ITU-R channels. In order to measure the BER, a fixed and known structure of subframes was used in the downlink and in the uplink. The purpose was to enable the measurement of the BER even when the FCH or the DL-MAP messages could not be decoded although the undetected frames are ignored. Figure 8 plots the coded BER with respect to the SNR for the 3.5-MHz downlink profile when transmitting over an AWGN channel. As expected, curves move to the right as the spectral efficiency increases. Similar results were obtained for other uplink and downlink profiles. Figure 9 shows the results of coded BER tests for the 8.75-MHz downlink profile considering the ITU-R channels. It can be seen that the pedestrian A results are consistent with those obtained for DL-MAP FER, given that undetected frames are ignored in BER measurements, and in Figure 6 , the results are also affected by the frame detection. The lack of multipath diversity explains its poor performance at low and medium SNR values. At higher SNR levels, however, the results in pedestrian A improve and outperform the others because their channel frequency response is easier to equalize. Figure 10 plots the coded BER for the uplink and the ITU-R channel models. These results are better than those shown in Figure 9 for the downlink, specially in the pedestrian B channel. This can be explained by the higher pilot density in the WiMAX uplink frame structure, which allows for better channel tracking. Regarding vehicular A channel models, an error floor can be observed in both downlink and uplink streams because channel estimation has not been designed to compensate the intercarrier interference (ICI) generated by the channel fast time variations. The ICI results in a source of constant noise for all subcarriers which produces the error floor appearing in the vehicular A channel.
Finally, Figures 11 and 12 show the FER over the ITU-R channel models for the downlink and the uplink, respectively. These FER measurements are not affected by the undetected frames, contrarily to the previous FER measurements. The measured burst in the downlink occupies 15 subchannels over 18 OFDM symbols, with a total of 6,480 data subcarriers per downlink subframe. In the uplink measurement, the burst occupies the complete subframe with 10,080 data subcarriers per uplink subframe. The downlink FER results are consistent with the downlink BER ones. The only significant difference is the worse results in the pedestrian B caused by the higher frequency selectivity, which increases the probability of isolated errors in every burst. This results in higher FER but does not affect significantly the BER measurement. The uplink FER results are also consistent with the measurement of the BER in the uplink. In general, the results of pedestrian A are improved when the FER is measured since the erroneous bursts occur more often due to the lower multipath diversity. 
WirelessMAN-advanced air interface
The IEEE 802.16m standard introduces a completely new definition of the PHY known as advanced air interface (AAI). The configurability of the parameters is reduced to a large extent, but additional features like multiple-input multiple-output (MIMO) and hybrid automatic repeat request (H-ARQ) are now mandatory to accomplish the minimum requirements of the standard, and also, backward compatibility is mandatory. For a more detailed description, see [44] .
A new profile with a channel bandwidth of 20 MHz and 2,048 subcarriers is added while the 3.5-MHz profile is discarded. To implement this new profile, the FFT size needs to support 2,048 subcarriers, and the DUC/DDC blocks have to support an additional up/downsampling factor of 25/7. The new frame structure is divided into superframes of 20 ms. Each superframe is made up of four 5-ms frames. The main difference with the old frame structure is the way the frames are subdivided into subframes to increase the flexibility of the allocation of downlink and uplink zones. Each subframe can be dynamically configured for downlink or uplink transmission. This dynamic behavior imposes the need to improve the Frame control block to be more flexible.
The synchronization mechanisms have been improved by defining two new preambles: the PA preamble, with a fixed number of pilot subcarriers regardless the FFT size to be used by the advanced base station (ABS), and the SA preamble, with a structure and purpose similar to the preamble of the previous release.
The new subchannelization scheme is designed to simplify the channel estimation and to reduce the signaling overhead required for the burst placement, and it only depends on the MIMO scheme at use.
The AAI defines newMIMO configurations to support single userMIMO (SU-MIMO) and multiple user MIMO (MU-MIMO) schemes, both with adaptive and non-adaptive precoding. The WiMAX Forum defines the minimum number of ABS antennas as two, while the advanced mobile station (AMS) can operate with only one antenna. This leads to the need to replicate processing in transmit and receive chains only in the ABS.
For the initial ranging and handover mechanisms, new ranging preambles are added with extended length. Ranging preambles are transmitted with a subcarrier spacing which is a fraction of the regular frequency spacing. This behavior can be achieved with larger FFT sizes; hence, an adjustable FFT size in the corresponding processing blocks could be desirable.
Channel coding in 802.16 m only uses two FEC schemes. On the one hand, convolutional turbo codes (CTC) is the encoder defined to transmit the data bursts. On the other hand, a TBCC encoder with rate 1/5 is used to encode the control information. In this case, it would be necessary to implement two encoding and decoding algorithms inside the FEC processing block. The mandatory H-ARQ processing can be addressed inside the PHY Control task.
The proposed architecture can be readily adapted to give support to an implementation of the AAI. As an example, an adaptation of the architecture is shown in Figure 13 , in which ABS and AMS are configured to support a 2 × 1 MIMO scheme, which would require an increase of hardware resources to support the implementation of the new functionalities. As noted before, the Frame control, Synchronization, and FFT/IFFT blocks must be enhanced to support the new subframe structure. The new subchannelization scheme can be implemented in the same DSP as the old PUSC blocks, as well as the channel equalization step. The ABS MIMO requirements impose the need to replicate the transmit and receive chains, forcing the increase of hardware resources in the FPGA and in the DSP modules, since they have to implement the new precoding techniques. In this case, we have concluded that a new DSP needs to be added into the ABS to accommodate the increase of the baseband processing needs. Also, the H-ARQ technique requires an increase in memory due to the need to store the received bursts. Finally, the new FEC schemes need to be implemented in a larger FPGA as the Virtex-II has not enough resources.
Conclusions
We have addressed the design and implementation of realtime OFDMA-TDD PHYs compliant with the WiMAX standard. We have presented a cost-effective SDR hardware architecture made up of FPGA and DSP modules that allows for the real-time implementation of all OFDMA-TDD PHY functionalities in the downlink and in the uplink at both the BS and the MS of the mobile WiMAX standard. We explained in detail the different design decisions adopted to accomplish this stringent objective. The proposed design is shown to efficiently use the available FPGA resources. Experimental evaluation of the downlink and the uplink obtained with the implemented BS and MS was carried out in real time using a hardware device that emulates AWGN and ITU-R wireless channel models. Specific performance metrics that take into account the frame and the DL-MAP messages detection were considered to illustrate the adequate performance of the proposed design. Finally, the utilization of the proposed hardware architecture to implement the WirelessMAN-advanced air interface is discussed.
Endnotes
a Set to 36 in the IEEE Std. 802.16e.
