This paper presents the design and implementation of a high performance baseband transceiver targeted for System on a Chip (SoC). The presented architecture utilizes a 4x4 MultipleInput Multiple-Output Orthogonal Frequency Division Multiplexing (MIMO-OFDM) system and is capable of enabling greater than 1Gbps wireless transmission. A complex channel equalization circuit is realized using matrix inversion via highresolution QR decomposition. Full synthesis results are included as this MIMO-OFDM transceiver has been proved on standard FPGA technology.
I. INTRODUCTION
The next generation wireless networks are expected to provide high speed internet access anywhere and anytime. The popularity and capability of current and next generation smartphones undoubtedly accelerates this trend and creates new traffic demand. Consequently, there is an increasing demand for high data rate transmission in future wireless networks. However, data transmission rate is limited by channel capacity which provides a theoretical upper limit for the data rate beyond which error-free transmission is impossible. This paper presents the architecture and implementation of a 4x4 MIMO-OFDM baseband transceiver. The design is targeted to an Altera Stratix IV GX FPGA.
II. BACKGROUND

A. MIMO-OFDM
OFDM is a method of encoding digital data on multiple carrier frequencies [1] . A large number of closely spaced orthogonal subcarrier signals are used to carry data. The data is divided into several parallel data streams or channels, one for each subcarrier. Each subcarrier is modulated with a conventional modulation scheme (such as quadrature amplitude modulation or phase-shift keying at a low symbol rate), maintaining total data rates similar to conventional singlecarrier modulation schemes in the same bandwidth.
MIMO is the use of multiple antennas at both the transmitter and receiver to improve communication performance [2] . Fig.1 shows a basic 4x4 MIMO spatially multiplexed communication system. Each transmit antenna streams an independent and separately encoded data signal. The main challenge faced in MIMO-OFDM systems is how to accurately obtain the channel state information from the training sequences and subsequently apply the inverse of this to the received information channels in order to decode and thus demodulate the received data.
The preamble specified in the IEEE 802.11a standard is typically used for time and frequency synchronization as well as channel estimation. The first part of the preamble consists of 10 identical short training symbols (STS) with duration of 0.8µs. These are used for time and frequency synchronization. The STS symbols are followed by a long guard interval (LGI) of 1.6µs containing a copy of the second half of the long training symbols (LTS). Two identical LTSs (T1 and T2) then follow, and are used for channel estimation. Channel estimation and equalization is the most computationally intensive operation of MIMO-OFDM decoding. The biggest challenge is to build a high speed, accurate system that will calculate the inverse of the complex channel matrix that will be obtained for each OFDM subcarrier. To reduce complexity, a channel equalization technique is implemented by using high resolution QR matrix decomposition [3, 4] .
B. Related Work
MIMO-OFDM algorithms and implementations have been thoroughly investigated in literature Most authors agree the provision of an accurate and robust channel estimation strategy is crucial for achieving high performance [5, 6] . Perels et al, [7] , presents an FPGA implementation of a 4x4 MIMO-OFDM receiver. They use Minimum Mean Square Error (MMSE) equalization technique realized from QR decomposition with an achievable system clock rate of 50 MHz. Boher et al, [8] , presents an ASIC implementation of a MIMO-OFDM transceiver for 192 Mbps WLANs.
III. DESIGN AND IMPLEMENTATION OF 4x4 MIMO OFDM BASEBAND CIRCUIT
A. Transmitter Architecture Fig. 2 shows a block diagram of the 4x4 MIMO transmitter architecture. Information data is split into four separate and independent channels that will each be encoded and modulated for transmission.
The transmitter must transmit preamble data before each burst of OFDM frames. The transmitter encodes and interleaves incoming data. The bits are grouped and mapped to I and Q complex data according to the modulation scheme. The modulated symbols are converted to the time domain via the inverse Fast Fourier Transform (IFFT) before transmission along with cyclic prefix as a series of OFDM frames.
The transmitter is preloaded with the frequency domain values for the STS and LTS, OFDM symbol pilots and a symbol mapper look-up table. Uncoded data is streamed into the 2/3 rate convolutional encoder to add adequate signal redundancy.
The output of the convolutional encoder is passed to a block interleaver circuit. The block interleaver consists of two memories, implemented using a large register structure.
The dual memory system allows continual streaming of data. Only when an entire memory block is full can it be read out to the symbol mapper. As one memory is accepting data from the convolutional encoder, the other memory streams out data using the interleaving pattern as specified by the 802.11a standard.
The symbol mapper is a simple Look Up Table  ( LUT). The address of this memory is the output of the block interleaver. The address width/interleaver output width is defined by the modulation scheme i.e. for BPSK this must be 1-bit, 2-bit for QPSK, 4-bit for 16-QAM and 6-bit for 64-QAM. Each address of the symbol mapper LUT contains the corresponding I and Q values that represent the constellation location. The dual port nature of the memory enables two look-up tables to service all four channels.
The control path contains a master finite state machine (FSM) which controls the transmission of each burst of OFDM frames including the preamble sequence. It enables the STS and LTS frequency domain symbols to be read out of their respective memories, and encoded data symbols to be read from an incoming FIFO and fed into the IFFT.
The final block before transmission is the cyclic prefix block. The transmitter cyclic prefix block is shown in Fig. 3 . This entity consists of a single dual port memory element, twice the depth of the OFDM frame. This is necessary to enable continuous data streaming. The last 25% of the OFDM symbol is selected as the cyclic prefix and must be transmitted first. Therefore, while one complete frame is being transmitted through the read port of the memory, the other half of the memory is able to collect incoming data through the write port. Fig. 3 shows the MIMO system preamble transmission pattern. STS data is transmitted from channel 0 only. The STS is required only for time synchronization, and to ensure a clean signal, only one transmitter is enabled. LTS data is transmitted from all four channels one after another. This is essential for channel estimation at the receiver. 
A. Receiver Architecture
The receiver must detect, demodulate and decode received OFDM symbols back to the original bit stream. Fig. 4 shows a diagram of the MIMO receiver architecture. The receiver contains a number of custom blocks including timing synchronizer; Fast Fourier Transform blocks (FFTs), buffers, channel estimation and equalization blocks, demappers, de-interleavers and decoders etc.
The first major entity on the receiver is the time synchroniser. The time synchroniser is designed to locate the start of a burst of OFDM frames when the system is in idle mode. Fig. 5 shows a diagram of the time synchronizer architecture. The time synchronizer must locate the end of the STS frame and the start of the LTS frame. The circuit is preloaded with the complex conjugate values of the last 16 STS symbols and the first 16 LTS symbols. The incoming data is correlated with the pre-stored data. At every clock cycle, a sliding window of 32 consecutive data samples are multiplied with the 32 pre-stored preamble values and summed. An array of 32 parallel complex multipliers are required along with a pipelined adder structure. The magnitude of the resulting complex value is calculated. A Coordinate Rotation Digital Computer (CORDIC) block is used as an efficient square-root calculation. The CORDIC output is compared with an expected threshold value to indicate that the start of a frame has been located.
Data is streamed into all four channels and stored temporarily in four individual circular buffers. Once the timing synchronizer indicates that the start of frame is located, the received LTS data, minus the cyclic prefix, is streamed into all four FFTs. A 4x4 complex matrix is obtained for each subcarrier within the OFDM symbol. This is the channel matrix. For each burst of OFDM symbols an array of 16 memories populated with the channel matrices (one matrix for each OFDM subcarrier).
Once the channel matrix is obtained, the channel estimation process takes place. The channel estimator computes the inverse of the channel matrix for every subcarrier. QR decomposition is performed on the channel matrix before inversion. Fig. 6 shows the matrix inversion process. The channel matrix is decomposed into two separate matrices a Q matrix and an upper triangular matrix R. The inverse of the channel matrix is calculated by multiplying the transpose of the Q matrix with the inverse of the R matrix. The channel matrix H is decomposed to a Q matrix and an upper triangle matrix R using a massive systolic array of CORDIC elements. QR decomposition is achieved by implementing the three angle complex rotation algorithm. Fig. 7 shows a block diagram of QRD circuit used to compute the R matrix. For clarity a reduced 2x2 matrix is shown. The boundary cells perform the vectoring calculation by rotating the complex value to the xaxis. The angles that have been rotated through to do this are passed horizontally through the systolic array. The internal cells rotate the complex values that travel from top to bottom by the angles that are passed from left to right. The systolic array for calculating the R matrix is connected directly to the array for calculating the Q matrix. This structure of this array is shown in Fig. 8 . Again for clarity a reduced 2x2 matrix is shown. A scheduler has been implemented which controls the read address of the channel matrix memories and multiplexes the outputs of these memories into the systolic array.
The R matrices are captured and passed to the R inverse calculation block. The R inverse block calculates the following equations:
This circuit is heavily pipelined with many shift registers required as some of the terms require higher computation and also because the calculation of some matrix terms require the result of other matrix terms (e.g. R -1 (2,3) calculation requires R -1 (3,3) ).
The inverse of the R inverse matrices and Q transpose matrices are stored in an array of memories. Once all subcarrier matrices calculated, they are streamed out one subcarrier at a time and fed into a 4x4 matrix multiplication block. Each resulting matrix is the channel matrix which itself is stored in a 4x4 array of dual-port memory blocks.
MIMO decoding takes place after the channel estimation process completes. OFDM data is read out of the four channel FIFOs. The corresponding channel estimation matrix is read out of the corresponding subcarrier location of the 16 channel estimate memories. The OFDM data and the channel estimation data are multiplied together in the form of a matrix multiplication. This multiplication results in the equalized OFDM data.
The pilot tones are extracted and descrambled. The average value of the pilot tones is calculated and phase correction is performed on the entire OFDM symbol by multiplying each subcarrier by the pilot tone average. The next step on the receiver data-path is to perform feed forward timing synchronization. Again the (now phase-corrected) pilot tones must be extracted. Each pilot tone is divided by its subcarrier number and then the average is calculated to determine the feed-forward time synchronization value, Tau.
Each subcarrier must be time corrected by adding the relevant Tau value to the real component and subtracting it from the imaginary component. The relevant Tau value for each subcarrier is simply the subcarrier number multiplied by the Tau value. In order to simplify this process, a running adder is used. Each clock cycle, as the time correction is performed on each incrementing subcarrier, the Tau value is also incremented using a feedback adder.
The symbol demapper is implemented using a decoder-multiplexer structure. The symbol demapper is set up to perform soft symbol demapping.
The output of the symbol demapper is fed into a block de-interleaver. The block de-interleaver has the same structure as the interleaver on the transmitter, except that the read and write address patterns are reverted. The de-interleaver is larger as it must store the soft bit representation of the data in every bit location.
Error correction is performed using a Viterbi decoder. Table 1 contains synthesis figures for the MIMO transceiver when configured with 16-QAM, and 64-point OFDM. Table 2 shows the resource utilization for each of the main processing blocks within the MIMO transmitter. For a 512-point OFDM system the IFFT and interleaver will require eight times as many resources. Table 3 contains Altera Stratix IV GX synthesis figures for the MIMO receiver when configured with 16-QAM, and 64-point OFDM. Table 4 shows the resource utilization of the MIMO receiver by entity. The channel estimation and equalization blocks (R matrix inverse, MIMO decoder, QR decomposition and QR multiplier) account for 86% of the ALUTs and 77% of the DSP multipliers within the circuit.
IV. IMPLEMENTATION
A. Transmitter
B. Receiver
The size and complexity of the channel estimation and equalization blocks will remain constant with respect to OFDM frame size. However, for larger OFDM frame sizes the processing latency will increase so that for a 512-point OFDM system, the number of memory bits required increases by a factor of approximately eight. However there are more than sufficient memory resources available on the FPGA to accommodate a 512-point OFDM system. 
