Abstract-This paper presents a space-time block coding decoder for MIMO-OFDM enabled mobile terminals. The decoder is implemented using a programmable baseband processor aimed for software-defined radio (SDR). The dynamic range supplied by the floating-point SIMD datapath allows special algorithms to significantly reduce the computational latency of decoding. The programmable solution not only supports different transmit/receive antenna configuration, but also allows hardware multiplexing to obtain silicon and power efficiency. Compared to several existing fixed-functional ASIC solutions in literature, the one proposed in this paper is by far the smallest, fastest and with more flexibility.
I. INTRODUCTION
Recently, multi-antenna or Multiple-Input and MultipleOutput (MIMO) technologies have been adopted by the latest wireless standards such as WiMAX and 3GPP LTE, which can significantly enhance the system performance by utilizing various degrees of freedom. Meanwhile, multi-carrier technologies such as Orthogonal frequency division multiplexing (OFDM) have been widely used as a technology to cope with frequencyselective fading and increase the spectrum efficiency.
Complex-valued matrix inversion and multiplication is involved in the baseband processing of MIMO systems. This usually requires much higher computing power compared to single-antenna systems. Furthermore, the number of channels to be processed increases significantly due to the use of OFDM. Since the amount of computing power is always limited in mobile terminals, MIMO-OFDM brings a challenge to the real-time baseband signal processing.
Due to the engineering difficulties (e.g. space needed and integration cost), it is hard to increase the number of antennas on the terminal side while it is usually practical on the basestation side. As illustrated by Fig. 1 , multi-user space-time block coding (MU-STBC) allows the increase of both diversity gain and data rate in the downlink by having more transmitting antennas on the basestation side. Among numerous MIMO schemes, Alamouti STBC [1] is the one most widely adopted in industrial standards (both WiMAX and LTE). Therefore, investigating the implementation of Alamouti MU-STBC has great importance for practical systems. Without loss of generality, this paper presents a real-time MU-STBC linear decoder using a programmable baseband processor.
The remainder of the paper is organized as it follows. In Sec. II, the system model is presented. Sec. III gives an introduction of the linear MIMO decoders. The newly proposed simplification of matrix inversion is elaborated in Sec. IV. Sec. VII presents both the FPGA and ASIC based implementation of the processor. Finally, Sec. VIII concludes the paper.
II. SYSTEM MODEL Diversity in MIMO leads to improved reliability of information transmission for a given data rate, which can be achieved in multiple dimensions such as space, time and frequency. Among the diversity enhancing schemes, space-time block coding (STBC) is one of the most widely used transmission method. The basic principle of STBC is to transmit multiple copies of the information symbols over multiple independent channels in time and space. Therefore, STBC can be used to improve the reliability of transmission by rendering the channel less fading and increasing the robustness to cochannel interference. Alamouti STBC is a basic STBC scheme proposed in [1] . The basic 2 × 2 Alamouti matrix is defined as
Multi-user Alamouti STBC is a scheme that uses multiple antenna arrays (two transmitting antenna elements considered as one user) to transmit information symbols using spacetime coding. It can be used for both downlink and uplink. In this paper, as depicted in Fig. 1 , only downlink scenario is considered where the basestation is using four antennas (equal to two data channels) to transmit data to a single mobile station (note in this case, the channel matrix is 4 × 4 instead of 4 × 2 due to the Alamouti scheme). In MIMO systems, the general transmission model is r = Hs + n (2) which can be applied to each subcarrier in OFDM systems. On the receiver side, the channel matrix H is estimated by the channel estimator, and the received vectors r are transformed using a linear or nonlinear matrix equalizer to computeŝ which is the estimation of the transmitted symbol vector s. The vector n is the additive noise at the receiver side.
III. LINEAR MIMO DECODERS
Linear MIMO decoders such as zero-forcing (ZF) and Minimum Mean Square Error (MMSE) decoders are among the most widely used decoders in MIMO systems owing to their low complexity. From Eq. 2, ZF decoding can be described in Eq. 3ŝ
whereŝ is the estimation of the transmitted symbol s. MMSE decoding can be described similarly in Eq. 4
Both ZF and MMSE decoding involve the calculation of a equalization matrix W, for ZF
and for MMSE
Here, σ 2 is the noise variance. The calculation of W needs to be done as soon as possible since the symbol detection can not start until W is ready. For the time W is being computed, received symbols have to be stored in the memory buffer. Therefore, the computation of W determines the memory cost and feedback latency from the receiver to the transmitter. On the other hand, the computation of W is channel rate instead of symbol rate computation, which means that W only needs to be calculated once during the channel coherence time. As shown in Eq. 6, MMSE takes the noise power into account which allows it to outperform ZF.
The calculation of W involves quite a few matrix manipulations such as matrix inversion and multiplication. The frequency of these matrix manipulations depends on the variation of the wireless channel. Taking a two-user 2 × 2 STBC based WiMAX system as an example, we assume 5MHz channel bandwidth is used and the working frequency f is 2.4 GHz. If the mobile handset is moving at speed v = 60km/h, the maximum Doppler shift f m is 139 Hz. The following formula (from [2] )
estimates the channel coherence time T c to be 8 ms and the symbol duration is 102.9µs. In this case, excluding the pilot and null subcarriers, the number of channel matrices to be inverted for each user can be around 30 (in OFDMA, one single user only use a subset of all subcarriers, in this paper we use 32). As mentioned in the last paragraph, the computation of W needs to be finished as soon as possible (e.g. 0.1-0.2 symbol duration to keep the on-chip memory small). The goal of this paper is to find a practical solution that meets this real-time constraint.
IV. COMPUTATION OF EQUALIZATION MATRIX W
Eq. 6 shows that computing W involves one matrix inversion and two matrix multiplication. According to [8] and [7] , matrix inversion is the most intensive computation involved in MIMO linear decoding.
A. Matrix Inversion for MIMO Decoding
As explained in [4] , for numerical stability, large matrices are usually inverted by applying QR decomposition. However, this is not necessarily the most efficient way to invert small matrices (e.g. 4×4). Since the size of H considered is typically between 2×2 and 4×4, there are other alternatives to compute the inverse which are faster, more silicon efficient while still providing sufficient numerical stability. In [4] , the use of a method called blockwise analytic matrix inversion (BAMI) is proposed to compute the inversion of complex-valued matrices by partitioning the matrix into four smaller matrices, and then compute the inverse based on computations on these smaller parts. For example, to compute the inverse of a 4 × 4 matrix M, it is first divided into four submatrices
The inverse of M can be computed as:
And it is also proven that the same method can be applied to even larger matrices (e.g. 8 × 8) at a much higher cost [5] .
B. Computational Simplification for STBC Decoding
Alamouti matrix based STBC is the most widely used MIMO scheme in latest wireless standards (e.g. 3GPP LTE and WiMAX). The structure of matrices based on 2 × 2 Alamouti sub-blocks (as shown in Eq. 1) remains invariant under several nontrivial matrix operations such as matrix inversion [6] . Besides the Alamouti structure, it is noticed by the authors that the matrix to be inverted H H H + σ 2 I is a Hermitian matrix. By utilizing these features, based on the BAMI method presented in Sec. IV-A, a novel simplification is proposed by us in this paper to significantly reduce the amount of computation needed to calculate W which is elaborated as follows.
For example, the channel matrix of a two-user STBC is a 4 × 4 Alamouti matrix depicted in the following:
Therefore, we have
which is a Hermitian Alamouti matrix with 
where
This computation involves only two 1/x, a few MAC and subtraction operations. The sign-flip operations are free with our hardware architecture. As shown in Eq. 3 and Eq. 4, after the inversion, (H H H) −1 is multiplied with H H which only requires 24 MAC operations owing to the structure of the matrix.
V. ARCHITECTURE OF THE BASEBAND PROCESSOR
In order to fully prove the concept of real-time symbol decoding for MIMO-OFDM, a programmable baseband processor is designed and implemented. The decoder has limited programmability due to design simplification, nevertheless it supplies enough flexibility to support most MIMO linear decoders (e.g. ZF and LMMSE). The processor can be programmed to utilize the Hermitian and Alamouti structure, while other systolic array based solutions such as [7] , [8] are not capable of doing the same thing.
As depicted in Fig. 2 , the baseband processor contains a 4-way SIMD Floating-Point Complex Multiply and ACcumulation (FPCMAC) datapath. Fig. 3 shows the schematic of one FPCMAC. With 16 complex-valued general registers and four accumulation registers, the processor is enough to compute the inverse of matrices not larger than 4 × 4 with little memory overhead. For larger matrices (e.g. 8 × 8), data need to be moved in between the register file and memory thus introducing some overhead. Fortunately, in most standards, only channel matrices not larger than 4 × 4 are involved. Fig. 3 . Schematic of the FPCMAC. The divider unit is not shown.
Inputs Outputs

Mnem
Name Description cycles sabs Cplx squared abs c = a.r 2 + a.i Owing to the simplification proposed in Sec. IV-A, the firmware implementation using the instruction set depicted in Table. I is straightforward. Compared to the BAMI method presented in [5] , the number of operations needed for MIMO decoding is further reduced by almost half, which makes the solution in this paper by far the fastest method for matrix inversion in literature. Based on the instruction set shown in Table. I, the computational latency of matrix inversion is shown in Table. II. The result shows that by utilizing the Hermitian and Alamouti structures, the computation can be simplified by 75% than the ordinary BAMI method presented in [9] . 
VI. NUMERICAL PRECISION
In order to measure the numerical precision of our solution, finite-length simulation using different floating-point datatypes is carried out. The performance (SNR/BER) curves of MMSE detection are depicted in Fig. 4 using 16-bit, 20-bit and 64-bit (IEEE double precision) floating-point datatypes. As Fig. 4 clearly shows that the BER/SNR performance curve of 16-bit solution reaches an error floor when the SNR reaches 25dB. Therefore, 20-bit floating-point datatype is chosen for the final hardware implementation in this paper. 
VII. HARDWARE IMPLEMENTATION
In order to evaluate the implementation cost of the symbol decoder presented in this paper with several existing solutions, the baseband processor design is synthesized using both the Xilinx FPGA and ST 65 nm CMOS ASIC technologies.
A. FPGA Prototype
For the FPGA implementation, Xilinx ISE and Core Generator were used to synthesize the datapath components for the Virtex4-xc4vlx200 FPGA. The input is a 4 × 4 matrix of complex floating-point values and output is the inverse matrix. All the basic units are generated using Xilinx Core Generator. The 20-bit implementations have 9 pipeline stages in the FPCMAC and take 8 cycles to finish the real value division (1/X). Our design has also been synthesized using Virtex2 and the main difference from Virtex4 is the clock frequency.
New
Ref Note that the baseband processor in this paper is a 4-way SIMD which can compute the W of four subcarriers in parallel, while in Ref. [8] and [7] the hardware only computes that of one subcarrier. As depicted in Table III , compared to the latest synthesis result presented in [8] , [7] and [9] , our 20-bit implementation is much faster and occupies smaller area.
B. ASIC Implementation
Table IV depicts the synthesis result of the baseband processor. As depicted in Table IV , the ASIC implementation can easily run at 400 MHz and takes only 22 cycles to compute the inverse of four 4 × 4 complex-valued matrices and in total 70 cycles to compute the W of four subcarriers. In other words, the computation of W for 32 subcarriers can be finished within 1.4 µs which is only 1.4% of one symbol duration and 0.02% of the channel coherence time.
New
Ref. [ VIII. CONCLUSION In this paper, a programmable MU-STBC decoder for MIMO-OFDM is presented. By comparing it with other existing solutions in Sec. VII, our implementation is the fastest with the lowest silicon cost. Furthermore, it can be extended to the decoding of other MIMO schemes (e.g. quasi-orthogonal STBC and spatial multiplexing) without modifying the underlying hardware. As a conclusion, by algorithm/hardware codesign, application specific programmable hardware allows us to achieve both efficiency and flexibility. 
