Abstract-The focus of this paper is to investigate the feasibility of using programmable DSP processors for MIMO based radio systems. Several detection algorithms were evaluated and MIPS costs for low complexity detection and channel estimation algorithms for OFDM-VBLAST MIMO systems where calculated. Based on the MIPS cost estimation a feasible hardware architecture was derived. The result shows the feasibility of implementing MIMO radio systems in a programmable architecture.
I. INTRODUCTION
Multiple-Input-Multiple-Output (MIMO) technologies for radio communications give significant improvements on capacity and channel efficiency. Its implementation onto ASICs (Application Specific Integrated Circuits) has been done for some systems. An ASIC implementation gives high performance for computationally demanding algorithms in specific applications. However, it does not give enough flexibility to support different requirements. Considering the various systems using MIMO technology and the convergence of multiple radio standards in a single product, it may be necessary to cover a relatively wide range of algorithms used in different standards and different systems.
According to benefits obtained, MIMO techniques can basically be split into three groups: space-time coding (STC) to maximize spatial diversity, space division multiplexing (SDM) to increase the transmit data rate, and close-loop systems. Space-time codes, (such as space-time trellis codes (STTCs), and space time block codes (STBCs)), are schemes to encode and transmit the streams jointly in several coded symbols simultaneously from each antenna, to protect transmission against errors caused by channel fading, noise and interference; while space division multiplexing (SDM) systems transmit streams of independent data over different antennas to increase the data rate.
The VBLAST (Vertical Bell Labs Layered Space Time) system is a commonly used system based on the Diagonal BLAST system [1] for SDM. Here, we give an example of a VBLAST system implementation on programmable baseband DSP processors. The implementation is based on the IEEE 802.11a specification, but uses four transmit antennas and four receive antennas to reach four times of performance of IEEE 802.11a in frequency-selective fading channel conditions. We will analyze and select algorithms for detection and channel estimation of an OFDM-VBLAST system and estimate their MIPS costs. Since the implementation uses a progrmmable solution, the same hardware can also be used for other radio baseband applications.
The paper is organized as follows. Section II describes the basic OFDM-VBLAST system model, with the detection and channel estimation algorithms investigated. In section III, a multiple processor architecture is presented, and the MIPS cost is estimated. Finally, conclusions are drawn in section IV.
II. SYSTEM MODELING

A. Scope of the MIMO transceiver
The OFDM system will be implemented as an example in this paper. The implementation of a MIMO-OFDM system with N t transmit antennas and N r receive antennas can be carried out as shown in Figure 1 . [4] . In this paper, we only consider symbol operations, hence the system starts after the mapper in the transmitter and ends with the de-mapper in the receiver.
The processing in the transmitter includes: 1) Division of data into several sub-streams.
2) Channel coding and interleaving.
3) Cyclic prefix (CP) appending. The receiver includes several processes as follows: 1) Time and frequency synchronization.
2) FFT after removing the cyclic prefixes.
3) Channel estimation using VBLAST detection for each OFDM sub-carrier. 
B. System modeling
The purpose of system modeling is to find the MIPS costs for channel estimation and data reception. The received signal at the j-th receive antenna at the k-th tone of the n-th block can be expressed as:
is the symbol transmitted from the i-th transmit antenna at the kth tone of the n-th block, and w j [n, k] is independent and identically distributed complex zero-mean Gaussian noise with variance σ.
C. Data frame structure
Antenna 1 The data frame structure for each antenna is illustrated in figure 2 , where the preamble portion has N t training symbols of length N I , and the user data portion has N data symbols of length N k , [4] . All the symbols have a Cyclic Prefix (CP) of length G to avoid inter-symbol interference. G > K 0 , which is the length of the discrete-time channel impulse response. At the receiver end, after removing the CPs and doing an N kpoint FFT, all the symbols become N k long Frequency domain sequences. The preamble sequences are used for channel estimation, while the data sequences are used for detection.
D. Channel estimation algorithm
According to the LS criterion, the temporal channel estima-
, where:
And the channel frequency response can be obtained by doing FFT:
E. Optimum Training Sequences
The 2K 0 · 2K 0 matrix inversion for computing h[n] is computationaly heavy. To simplify the equation, we can simplify the matrix Q. Assume that the modulation results is a constant-modulus signal, so that
. This means that the training sequences from different transmit antennas are desired to be orthogonal and shift-orthogonal. Then, we can get the simplified algorithm for a quasi-static fading channel, with some acceptable performance degradation [2] .
With the same definition for Q ij , q ij , P i and p i as in the original algorithm.
F. VBLAST Detection
Several common methods for VBLAST system detection are listed below:
1) Matched Filter (MF): MF multiplies the received signal with the conjugate transpose of the corresponding channel coefficient, and compares the result with all possible symbols. It needs the channel coefficients to be orthogonal to avoid interference from different paths.
2) Maximum Likelihood (ML): ML compares the received symbol with all possible combinations of symbols after transmission resulting in a vey high complexity for searching.
3) Zero Forcing (ZF): ZF multiplies the inverted channel coefficient matrix with the received signal.
4) Minimum Mean Square Error (MMSE):
Based on ZF, MMSE considers the Signal Noise Ratio (SNR) when multiplying the inverse matrix.
5) Successive Interference Cancellation (SIC)
: SIC includes 3 processes: ordering, nulling, and cancellation [3] . First, all the substreams are ordered according to signal strength, starting with the strongest signal. Second, the receiver estimates the strongest signal, nulling out the remaining weaker signals, according to some performance criterion, like ZF or MMSE. Finally, the interference of the detected signal is subtracted from the signal vector, to reduce the complexity.
6) Square Root Algorithm: Computation of the N t · N t matrix inverse in SIC is complex. One method to simplify this is by SR decomposition, where H is decomposed into a unitary matrix Q and an upper triangular matrix R, satisfying: 7) SM Algorithm: In [6] it is shown that the SR algorithm can be further simplified, and another SIC MMSE algorithm is described, using the Sherman-Morrison formula, (noted SM). The algorithm can be summarized as: Initialization:
G. Detection Algorithms Comparison
In general the performance of the MF algorithm is to low, while ML, ZF and MMSE are complex to implement, so SIC is the most comminly used algorithm. In [6] , the computation cost for three decomposition methods for implementation of SIC, (Singular Value Decomposition (SVD), SR, and SM) can be found. The result is summarized in table I, where the transmitter and the receiver have the same number of antennas M = N t = N r . 
It is apparent that SM is more efficient that the other two, so we focus on the SM algorithm.
H. Flow chart
The SM algorithm as pplied after synchronization can be described as follows: The channel coefficient H is calculated by the process shown in Flow 1, then the weighting vector w is calculated, based on H, as shown in Flow 2. Finally, the original signal s is recovered using w, following the process in Flow 3. 
.N t execute step 9,12 9) For k 1 = 1..K 0 execute step 10-11 10) 
.N r execute step 9,13-14 9) For i = 1..N t execute step 10-11 10)
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the WCNC 2006 proceedings. 
III. COMPUTATION PROFILING AND COMPLEXITY ANALYSIS
Based on the previous section, the most common operations used can be summarized as 
A. Definition of Benchmarking and MIPS
MIPS cost computations have been made for a programmable DSP processor architecuter optimized for baseband processing [7] . The numbers are based on the following assumptions: 1) One Complex valued MAC per cycle 2) One radix-2 FFT butterfly per cycle; 3) Memory operations run in parallel behind computing; 4) Prolog and epilog costs and other overhead is around 10% of the computing cost.
B. Architecture Proposal
In this section, we propose a processor architecture suitable for implementing OFDM-VBLAST systems onto. We consider a homogenous multi-processor system where each processor node is a DSP-processor containing:
• Native complex data-paths • A double CMAC-unit capable of performing a Radix-2 butterfly per cycle • Local memories • A RISC control unit • Network interface for fast data-transfer An overview of the processor architecture is presented in Several processor nodes are connected to each other an interface blocks trough an on-chip network, SoCBUS [8] . SoCBUS is a low-latency on-chip network which allows the communicating cores to set up dedicated data channels between each other.
In addition to processor nodes, the system also contain input and output nodes as well as dedicated memory nodes to complement the local memory in each processor node. Input and output nodes are responsible for the interface towards the ADC/DAC for each radio as well as the interface towards the bit manipulation part of the system. An overview of the complete system is presented in Figure  4 .
C. MIPS costs for channel estimation
The computational complexity is concentrated on two steps: computing the P matrix based on the signal received, and computing the channel matrix based on the P and Q matrices.
1) Computing the P matrix:
The P matrix can be expressed as:
, where r is a received symbol at one receive antenna after synchronization in time domain, and S is the N t ×N t preamble symbol matrix for the corresponding sub-carrier. The cycle costs of the (I)FFT for one symbol is
The cost of multiplication with N t transmitted preamble symbols stored at the receiver end, is N t N k . For N r receive antennas, the total MIPS cost for processing a received symbol is
2 cycles. After adding the cost of subtraction and FFT, the total cost for one received symbol at N r receive antennas is N r N t (
Therefore, the total cycle cost for one received symbol at
Given our example with four transmit antennas and four receive antennas, that is, N t = 4, N r = 4, N k = 64, K 0 = 10, where N k and K 0 are based on the experience from IEEE 802.11a [9] , the total cost for channel estimation is 4{64[(4 + 1 2 )log 2 64 + 4] + 10 × 4(10 × 4 + 1)}=14496 complex valued instructions. If these computations must be completed in one symbol time (4µs) the corresponding MIPS cost is 3624 MIPS, for reaching 4 times the data rate of IEEE 802.11a. Assuming the previously mentioned processor core could perform four complex operations per clock cycle, 800 MIPS could be mapped to each processor assuming a clock frequency of 200 MHz. 5 such processors are needed to handle the computational load in this example.
D. MIPS costs for detection
For initialization, the cycle cost for calculating R is N t N r (N t + 1)/2 + N t , for T the cost is 5 2 N t N r (N t + 1), for w ki it is N t N r , and for y ki = q ki r it is N r . For each loop i in the (N t − 1) recursions, the cost for cancellation is 2N r , for computing T the cost is (N t − i) , and for y ki it is N r .
Considering N k sub-carriers, the total cost for one received symbol at N r receive antennas for channel compensation can be summarized as: 3N r N k N t − N r N k for each data symbol plus:
for each data frame. Continuing on the example, with N t = 4, N r = 4, N k = 64, K 0 = 10, the total cost for channel compensation is:
(4−i)(4−i+1)+4(4−i)]+4×4(3×4+4)+4} = 21376
instructions for one data frame, plus 3 × 4 × 64 × 4 − 4 × 64 = 2186 instructions for one data symbol. Again these calculations must be completed in one symbol time of 4µs. The channel compensator can be implemented with seven additional cores.
IV. CONCLUSION
This paper has presented estimations of MIPS costs for channel estimation and detection algorithms in an OFDM-VBLAST system, for MIMO transceiver implementations using programmable processors instead of ASIC. The final MIPS cost for each algorithm have been estimated. With the example of using four transmit and four receive antennas, to achieve 4 times bandwidth of IEEE 802.11a standard, the MIPS cost for the algorithms shows the feasibility of such an implementation.
