Abstract This paper describes a pipelined parallel algorithm for the MMSE-OSIC decoding procedure proposed in V-BLAST wireless MIMO systems, for heterogeneous networks of processors. It is based on a block version of the square-root Kalman Filter algorithm that was initially devised to solve the RLS problem. It has been parallelized in a pipelined way obtaining a good efficiency and scalability. The optimum load balancing for this parallel algorithm is dynamic, but we derive a static load balancing scheme with good performance.
Introduction
Multiple Input Multiple Output (MIMO) systems have been extensively studied in the context of wireless communications in recent years. The original proposal by Foschini [1] , known as BLAST (Bell Labs Layered Space-Time), has generated a family of architectures that uses multiple antenna arrays to transmit and receive information, with the goal of increasing the capacity and reliability of the links. We focus our interest on the suboptimal but practical V-BLAST family where nearly optimal decoders such as MMSE (Minimum Mean Square Error Estimation) and its ordered version OSIC (Ordered Successive Interference Cancelation), [2, 3] , can be used. There are applications that use these methods to decode information in a digital receiver at a higher rate or with less bit error ratio than in SISO (Single Input Single Output) channels [4, 5] , where the dimension of the problem may be several thousands, i.e. multicarrier systems such as OFDM used in DVB-T (Digital Video Broadcasting-Terrestrial), WIFI 802.11a, etc.
This paper describes the parallelization of the algorithm that solves the MMSE-OSIC decoding problem for a heterogeneous network of processors. The interest of the heterogeneous perspective in these applications is focused on future experimentations with GPUs where the CPU (with less computing power) collaborates in the problem solution. The algorithm is based on the ideas of [2] and can be used with the improvement presented in [3] .
The paper is organized as follows: first we will describe the MMSE-OSIC decoding algorithm; next, we will describe the pipelined parallelization of this algorithm explaining the details of the proposed load balancing scheme; and finally, results of the experimentation in a heterogeneous parallel system will be stated.
MMSE-OSIC decoding procedure
In a basic approach, we need to solve the typical perturbed system y = Hx + v, where the known full-rank matrix H ∈ C m×n , m ≥ n, represents the channel matrix and any manipulation of the symbols before transmission; x is a vector whose components belong to a discrete symbol set, and v is the process noise. Let us define H α as the augmented channel matrix:
The use of MMSE (Minimum Mean Square Error estimation) yields [2]
where,x is the estimation of x, · denotes the mapping or slicing of the result in the discrete symbol set, H †|m α denotes the first m columns of the pseudoinverse of the augmented channel matrix (1), α −1 is a signal-to-noise ratio, and the asterisk superscript (·) * denotes the complex conjugate. In OSIC, the signal components x i , i = 1, . . . n, are decoded from the strongest one (with the highest signal-to-noise ratio) to the weakest one, canceling the contribution of the decoded signal component to the received signal, and then repeating the process with the remaining signal components [2, 6] .
Every time the maximum signal-to-noise ratio remaining component is decoded, it is necessary to compute the pseudoinverse. In [2, 6] it is shown how to avoid this recomputation and how to optimize the way to obtain every component. Hence, we can obtain:
where, P 1/2 is a square-root factor of the error estimation covariance matrix P = E{(x −x)(x −x) * } = (αI n + H * H) −1 , and Q α ∈ C m×n are the first m rows of the orthogonal matrix Q from the augmented channel matrix QL-factorization, H α = QL. The P 1/2 and Q α matrices are the only necessary data to compute in order to obtain the result of the combined MMSE-OSIC algorithm. We can use the squareroot Kalman Filter algorithm [7] in order to obtain these matrices [2, 6] .
The square-root Kalman Filter for MMSE-OSIC
From (3), Q α = H †|m * α P − * /2 . This matrix is propagated throughout the iterations of the square-root Kalman Filter that was initially devised to solve an RLS (Recursive Least Squares) problem. Next, we reproduce a block version of the algorithm for MMSE-OSIC that was reported in [2] , which we refer to as SRKF-OSIC.
Calculate (i) and apply it in such a way that:
where,
e,(i) ; q is the number of consecutive rows of H processed in a block, thus H i ∈ C q×n is the ith block of q consecutive rows of H; the iteration index subscript enclosed between parentheses denotes that the variable is updated iteratively, i.e., Q α and P 1/2 are the values Q α,(i+1) and P (i) are variables of the Kalman Filter whose meaning is described in [7] , and H †|m α,(i+1) appears implicitly in Z (i) . The cost of one iteration is a matrix multiplication H i P
1/2
(i) and the application of a sequence of Givens rotations (i) , exploiting and maintaining the triangular structure of P 1/2 (i) throughout the iterations. The total cost of the iterations is W seq (m, n, q) = 4n 2 m + 3nm 2 flops approximately.
Parallel algorithm
A matrix enclosed within square brackets with a processor subscript denotes that part of the matrix belongs to that processor. If it is enclosed within parentheses, then it denotes that the entire matrix is in the processor of the subscript.
The last n columns of either E (i) or F (i) matrices of the SRKF-OSIC algorithm will be distributed among the p processors: P j will own n j columns beginning at c 0 j with n = p−1 j =0 n j . The first q columns of either E (i) or F (i) will be denoted by
will be manipulated in a pipelined way by all the processors, so we denote it as (D (i) ) P j , where P j represents the processor that is processing it. D (i) is made up of a lower triangular matrix J (i) ∈ C q×q and two general matrices M (i) ∈ C n×q and N (i) ∈ C m×q . M (i) will be divided into p groups of rows, (the number of rows will be indicated by a left superscript). The nonzero rows of N (i) will depend on the iteration index i.
Processor tasks
When the P j processor obtains zeroes in [H i P 1/2 (i) ] P j for the ith iteration, the ma-
(i+1) ] P j and begin the process for the (i + 1)th iteration. The pipelined behavior of the algorithm is based on this fact.
As an example, let us suppose that we have p = 3 processors: P 0 , P 1 and P 2 . In P 0 , the first step in the pipelined parallel algorithm is presented below (an apostrophe denotes the updating of the variable):
where i,P 0 is a unitary transformation calculated and applied by P 0 in order to obtain zeroes in [H i P 1/2 (i) ] P 0 . Now, P 0 must transfer the nonzero part of (D (i) ) P 0 to P 1 . If P 0 has received H i+1 , it has all the necessary data to make up [H i+1 P 1/2 (i+1) ] P 0 and start again. When P 1 receives (D (i) ) P 0 , then:
The comments given above for P 0 also hold for P 1 . When P 2 receives the nonzero part of (D (i) ) P 1 , then:
The comments given above for P 0 and P 1 also hold for P 2 . (See Fig. 1 which depicts the behavior of the parallel algorithm when P 0 is processing the ith iteration.)
The arithmetic cost in the ith iteration in the P j processor is due to the matrix multiplication [H i P 1/2 (i) ] P j and to the zeroing in the n j columns of [H i P 1/2 (i) ] P j (columnwise and from right to left). We can verify that the parallelization arithmetic overhead is null.
Load balancing in heterogeneous networks of processors
The perfect load balance is obtained when all the processors require the same amount of time to process their assigned workload in certain time interval. In the pipelined algorithm each processor processes a different iteration at the same time instant. It is difficult to get an analytical expression to balance the load with this consideration. We can consider a simpler criterion to balance the workload:
where t w j and t w k are the time per flop in P j and P k respectively. The n j , ∀j , values can be obtained by solving the following ideal equation:
where, S max (p, n, q) is the maximum speed-up attainable in the parallel system and t w seq is the time per flop for the sequential algorithm. The maximum speed-up in the heterogeneous network depends on the time per flop t w j of each processor. Let us define s j as the normalized relative speed of the processors (dimensionless):
We can verify that p−1 j =0 s j = 1, and t w j s j = t w k s k , ∀ j, k. We can also verify that if P j is u times faster than P k , (t w j = t w k /u), then s j = us k .
Let us suppose that the sequential algorithm is run on the fastest processor of the heterogeneous network, say P f , 0 ≤ f ≤ p − 1; then t w seq = t w f . The maximum speed-up can be obtained from (5) when the load balance has got (4) and there is no parallel arithmetic overhead (Sect. 3.1). Let us solve S max (p, n, q) from (5) with j = f :
Hence, the maximum speed-up in the heterogeneous network is the inverse of the normalized relative speed of the fastest processor. The (sub)optimal n j values depend on the i iteration value, so the proposed load balancing scheme is not static (hence, we should change the column distribution at every iteration). One possible solution to obtain a suboptimal but static load balancing scheme is to get it for the worst case, where the work load and the unbalancing are the highest (at the last iteration: i = m/q − 1). Hence, we will begin by obtaining n p−1 with c 0 p−1 = 1, then n p−2 with c 0 p−2 = c 0 p−1 + n p−1 , and so on.
Communication analysis
The data that P j must transfer to P j +1 for its ith iteration (see Fig. 1 ) is the nonzero part of (D (i) ) P j and the H i submatrix. Let us suppose that a linear model fits this communication time. We can consider two communication network models: one in which the transfer between adjacent processors can be made simultaneously (model A, i.e., a linear array topology) and another in which these transfers must be done serially (model B, i.e., bus topology). Hence, the total communication time for both models is (mn) + (m 2 /q) and (mnp) + (m 2 /qp) respectively.
Scalability analysis
We obtained a null arithmetic overhead in the parallelization, so the overhead time in the parallel algorithm is only due to the communication time. The serial time must be compared with the total communication time overhead in order to get the scalability of the parallel system [8] . In our case, the scalability ranges between n, m = (p) and n, m = (p 2 ) for the A and B models respectively. Hence, a linear scalability behavior can be obtained with just a linear array topology.
Experimental results
The heterogeneous system used to test the parallel algorithm consisted of:
• Node 0: A monoprocessor with a 1.7 GHz Pentium IV with 1 GB of main memory, running a CentOS 4.4 32-bit operating system; • Node 1: a CC-NUMA multiprocessor with two 1.4 GHz two-core Itanium processors with 8 GB of memory running a Red Hat Enterprise Linux AS 64-bit operating system; Taking as reference times the execution time of the sequential algorithm in both kinds of processors, the normalized relative speed of the processors (6) were s 0 ≈ 0.04 and s 1 = · · · = s 4 ≈ 0.24, for m = 6000 and q = 20 and independent from n. The maximum normalized relative speed in this system is max{s j } ≈ 0.24, then S max ≈ 4.2. Hence, the maximum attainable efficiency is E max = S max /p = 4.2/5 ≈ 83%. Figure 3 depicts the efficiency of the parallel algorithm. For high values of n, the efficiency is about 60-65%, which represents a relative efficiency of 72-78% with respect to the maximum.
Conclusions
We have proposed a pipelined parallel algorithm to solve the heaviest part of the MMSE-OSIC decoding problem based on the square-root Kalman Filter algorithm. All the processes derived from the parallel algorithm are regular, so the execution in a heterogeneous network implies that the load must be distributed evenly according to the speed of the processors. Although the ideal load balancing scheme for our parallel algorithm is dynamic, the behavior of the proposed static load balancing scheme in the heterogeneous system was satisfactory, with good efficiency results near to optimum values. The algorithm parallelization can be used in MIMO detection appli- cations with GPU, FPGA or other VLSI systems implementations due to its implicit regularity.
