I. INTRODUCTION
Multiple-input multiple-output (MIMO) techniques in combination with orthogonal frequency division multiplexing (OFDM) technique (MIMO-OFDM) have been identified as a promising approach for high spectral efficiency wideband systems [1] , [2] . The optimal detection method for coded systems would be the maximum a posteriori (MAP) detection. However, the computational complexity of optimal MAP detection is beyond the limit of most systems, and, thus, such an approach is not feasible. A suboptimal approach is to use suboptimal zero forcing (ZF) or minimum mean square error (MMSE) criterion based linear detectors [3] .
Several approaches exist to solve the matrix inversion required by the LMMSE detector [4] , [5] . In this paper, two square root free algorithms are considered for the implementation of a LMMSE detector. The algorithms, namely the coordinate rotation digital computation (CORDIC) [6] algorithm and the squared Givens rotation (SGR) [7] algorithm, are applied to compute the QR decomposition (QRD) via Givens rotations. Then the matrix inversion is obtained by using a triangular matrix inversion algorithm [8] or a back substitution algorithm [4] . Two detector architectures, based on these algorithms and systolic array structures [9] , [10] , are designed for a 2 x 2 MIMO-OFDM system and implemented in a field programmable gate array (FPGA) chip. The pipelined architectures are fast, parallel, and suitable for OFDM systems where the calculation of detector coefficients has to be done for multiple subcarriers in channel coherence time.
The FPGA implementations of the detectors are mapped to the Elektrobit OFDM testbed for 4G MIMO systems (EB4G). The performance of the algorithms is evaluated using the EB4G hardware testbed and a Propsim C8 MIMO channel emulator creating the four baseband channels in the 2 x 2 MIMO system in real-time. The performance results in different real time channels are presented and evaluated.
The paper is organized as follows. The system model is presented in Section II. The designed architectures are presented in Section III. The hardware implementation in FPGA is presented in Section IV. The performance results are presented in Section V. Summary and Conclusions are presented in Section VI.
II. SYSTEM MODEL A MIMO-OFDM system is considered with two transmit antennas and two receive antennas as shown in Figure 1 . The received signal can be expressed in terms of the code symbol interval as rp = Hpxp + rip, p = 1, 2,..., P, (1) where P is the number of subcarriers and the received signal vector, the transmit symbol vector and the noise vector are defined in the frequency domain, respectively, as rp = Fig. 1 (2) where we assume Rxx = EsI2 and R,,,, = N012.
The calculation of the LMMSE solution in (2) requires a matrix inversion operation which is computationally a very complex task. In this paper, two square root free methods based on QRD via Givens rotations are considered for calculation of the matrix inversion in (2): * The CORDIC [6] + the back substitution [4] algorithms * The SGR [7] + triangular matrix inversion [8] algorithms For more details, see [12] .
III. ARCHITECTURE
The architectural design of matrix operations in LMMSE detectors are based on systolic array structures with communicating processing elements (PEs) [9] , [10] . A simple and highly parallel triangular array architecture is applied for computing the QRD [9] . The architecture enables a simple data flow and achieves high throughput with pipelining. This is important in MIMO-OFDM system, where the detector coefficients are calculated separately for each subcarrier in the interval of the channel coherence time. Both the algorithm for inversion of a triangular matrix [8] and the back substitution algorithm [4] are implemented using a triangular array architecture.
The high level architecture of the LMMSE detector is presented in Figure 2 . Computationally the most complex part of the detector is the coefficient calculation block, i.e., the calculation of (2) . The CORDIC and SGR based architectures for 2 x 2 LMMSE detector coefficient matrix calculation are illustrated in Figure 3 and in Figure 4 , respectively. The matrix to be inverted in (2) is formed in part Al. The matrix inversion is then calculated in part A2 which consists of two systolic arrays. The QRD of the matrix to be inverted is calculated using CORDIC or SGR algorithms in the upper systolic array. The lower systolic array applies the back substitution or the triangular matrix inversion algorithm. The matrix-matrix multiplication of the inverted matrix and the channel matrix is calculated in the part A3 in the SGR based architecture and in back substitution in part A2 in CORDIC based architecture. For more details of the architectures, see [12] . The architectures applied do not require much control logic and the mapping of data flow is relatively straightforward.
IV. IMPLEMENTATION
The FPGA implementations of the detectors are mapped to the EB4G MIMO-OFDM testbed which consists of high-speed configurations up to 4 x 4 MIMO and has flexible interfaces for digital and analog baseband, intermediate frequency (IF) and radio frequency (RF) connections. The FPGA implementations of the detectors are synthesized for a Xilinx Virtex-I1 XC2V6000 chip and they are designed to operate with a 66 MHz clock frequency which is the internal frequency used in the EB4G. The EB4G main technical parameters are listed in Table I .
The high level architecture of the LMMSE detector is presented in Figure 2 . The LMMSE coefficient calculation block receives the scaled channel estimates Hp and EB as an input and gives the detector coefficient matrices Wp as an output for each subcarrier p. Adaptive scaling is applied to the LMMSE input values to set the values at optimal level in respect to the coefficient calculation block internal accuracy. The scaling is done separately for each subcarrier according to the largest channel estimate coefficient. The scaling is compensated for after the detection. The coefficient calculation operates at frame interval, i.e., they are recalculated every 80 The SGR based implementation uses mainly 18 bit fixedpoint internal word lengths coefficient matrix calculation, which includes the matrix-matrix multiplications, the SGR based QRD and the triangular matrix inversion. Adaptive scaling is done in coefficient calculation before the matrix inversion due to the high dynamic range requirements of the SGR based QRD. The matrix to be inverted in A2 part is scaled to a desired level according to the highest value of each matrix. The scaling then compensated for after the matrix inversion as illustrated in Figure 4 . It was noted that the reciprocal divider needed in the QRD and in the matrix inversion calculation is the most accuracy demanding point in the matrix inversion calculation. The implemented divider uses 20 bit fixed-point word length and also three step adaptive scaling to reduce the required dynamic range of the signal. The latency for calculating all 52 2 x 2 coefficient matrices with SGR based implementation is 574 clock cycles, i.e., 8 .69,ts. The device utilization of SGR based LMMSE detector in EB4G is listed in Table III .
V. PERFORMANCE EXAMPLES
The performance measurements are done with a EB4G hardware testbed and a Propsim C8 MIMO channel emulator is used to create the four baseband channels in the 2 x 2 MIMO system in real-time. A photo of the measurement configuration is shown in Figure 5 and the EB4G main technical parameters are listed in Table I . The performances of both CORDIC and SGR based detector implementations are measured at baseband and they are compared to simulation results. The simulations have been done in Matlab with a floating point representation. The effect of the LS estimator used in the realtime measurements is applied to Matlab model by adding noise to the channel coefficients [13] .
A convolutionally coded spatial multiplexing (SM) transmission with quadrature phase shift keying (QPSK) and with bit-interleaving is applied with LMMSE detector and Viterbi decoder at the receiver. The 1/2 rate convolutional code is applied with [171, 133] polynomial and the coding is done over one OFDM symbol interval. A least squares (LS) based channel estimator is used in the EB4G with two OFDM pilot symbols per frame [13] . A Figure 6 . It can be noted that the total performance loss of the implemented system is between 2-2.5dB at BER level of 10'2 with BS antenna separations of 4A and 1OA compared to simulated results. The antenna separation of 0.5A, i.e., the high correlation case, results in greater implementation loss. The channel realization with higher correlation results in higher eigenvalue spread which leads to higher dynamic range in the signal representation in the calculation of WP. The WINNER Al channel scenario is rather flat fading and, in the case of a bad channel realization, it affects multiple subcarriers. Thus, the channel coding and the interleaving are not often capable to correct the errors. It can be noted that the BER performance of the SGR based detector starts to saturate earlier than the performance of the CORDIC based detector. This is due to the high dynamic range requirements of the SGR algorithm.
BER results with coded system in WINNER B 1 channel are shown in Figure 7 . In this case the performance loss between measurement and simulation results is between 1. BER results with coded system in WINNER Cl channel are shown in Figure 9 . The performance loss due to higher antenna El3-CORDIC, BS=1 01ambda X Sim, BS=%lambda 10 . Convolutional coded SM system with LMMSE detector and Viterbi decoder and Alamouti system in WINNER B I channel with 120kmph velocity.
were based on CORDIC and SGR algorithms, and designed using systolic array architectures and fixed-point arithmetic.
The measurement results were compared to Matlab floating point simulations. It was noted that the perfornance of the detectors was highly dependent on the channel scenario and the correlation properties of the channel. Typically the performance loss between simulated and measured results was between 1-2dB. The results showed that the SGR based detector implementation has problems with fixed-point arithmetic related to large dynamic range required in the signal representation. The problems would decrease with floating point arithmetic. The traditional CORDIC design seemed more suitable for considered system with fixed-point arithmetic. However, it was shown that both implementations work in real channel environment. 
