A systolic array to implement lattice-reduction-aided linear detection is proposed for a MIMO receiver. The lattice reduction algorithm and the ensuing linear detections are operated in the same array, which can be hardware-efficient. All-swap lattice reduction algorithm (ASLR) is considered for the systolic design. ASLR is a variant of the LLL algorithm, which processes all lattice basis vectors within one iteration. Lattice-reduction-aided linear detection based on ASLR and LLL algorithms have very similar bit-error-rate performance, while ASLR is more time efficient in the systolic array, especially for systems with a large number of antennas.
INTRODUCTION
Lattice-reduction-aided detection (LRAD), which combines lattice reduction techniques with linear detections or successive spatial-interference cancellation, has been shown to yield some improvement of error-rate performance [1] [2] . In LRAD, the lattice reduction algorithm need be performed when the channel changes. If the channel changing rate is high, or a large number of channel matrices need be processed such as in a MIMO-OFDM system, a fast-throughput hardware structure is needed for real-time applications. To this end, we propose a systolic array to implement the linear LRAD. Systolic array, allowing simple parallel processing, can achieve higher data rates without the demand on faster hardware capabilities. Hence, systolic array may be one of the best solutions for the practical implementation of a MIMO detector.
In this paper, we consider the LRAD based on all-swap lattice reduction (ASLR) instead of the most widely used LLL algorithm [3] . ASLR is a variant of LLL and was first proposed in [5] for real lattices. A complex-number version ASLR is presented in this paper. A crucial difference between ASLR and LLL algorithm is that all lattice basis vectors are simultaneously processed during a single iteration. Since ASLR was originally designed for parallel processing, a systolic array running ASLR is on average more efficient than one running LLL. After lattice reduction, linear detectors, such as zero-forcing (ZF) and minimum mean-square
The work of N.C. Wang was partially supported by National Science Council, Taiwan. (TMS-094-2-A-002).
error (MMSE), can also be implemented by the same systolic array without any extra hardware cost.
The following notations are used throughout the remaining sections. Capital bold letters denote matrices, and lower case bold letters denote column vectors. 
LATTICE-REDUCTION-AIDED LINEAR DETECTION
We consider a MIMO system with m transmit and n receive antennas in a rich-scattering flat-fading channel. Let x be the transmitted M-QAM signal vector, y the received signal vector and the n m × channel matrix where the entries are uncorrelated, zero-mean, unit-variance complex Gaussian fading gains. The baseband model for this MIMO system is y = Hx + n , (1) where n is the white Gaussian noise vector. Additionally, we assume the channel matrix entries are fixed during each frame interval, and the receiver has perfect knowledge of the realization of . In MIMO detection, the objective of the lattice reduction algorithm is equivalent to derive a better-conditioned matrix H along with a unimodular matrix T from the original channel matrix H under a given criterion such that H = HT [1] . Linear LRAD is to combine the lattice reduction algorithm with the linear detection, such as ZF and MMSE. Consider ZF first, and the estimated signal x can be written as
Let ˆq x be a version of x quantized elementwise. From (2), it is clear that ˆq
x is an estimate of 1 − T x , rather than of x . Hence, the last step is to transform ˆq x back into an estimate of x , i.e., ˆL R q = ⋅ x T x . (2) also applies to MMSE detection if the extended system model in [1] is considered. Simply substitute, for H and y , the extended channel matrix and the extended received vector, respectively. The remaining operations are the same as in ZF. 
ALL-SWAP LATTICE REDUCTION ALGORITHM
δ is a constant chosen between 1/2 to 1. The process to make the basis set satisfy (3) is called size reduction (SR). Table I describes the complex ASLR algorithm. In the following discussion, we refer to the lines in Table I . One significant difference between LLL and ASLR is that the pair of columns k and 1 k − with all even (or odd) indices k could be swapped simultaneously (lines 10 and 13). For systolic arrays, all these column swaps within one iteration can be done in parallel. Additionally, unlike the LLL algorithm considered in the literature [1] [4], size reduction process in ASLR applies to all the columns of H during one iteration (lines 3~8), and we called it "full size reduction (FSR)." The advantage of FSR over SR in our proposed systolic array will be shown in Section 4.
Two minor modifications of the original ASLR algorithm are made to accommodate the systolic array design. First, the Givens rotation (lines 17~21) is executed before the column swap (line 22). This is because the Givens rotation process can work in parallel with FSR, whereas the columns swap cannot. This point will be made clear in Section 4. Second, the QR decomposition 
SYSTOLIC ARRAY FOR ASLR ALGORITHM

Systolic Array for FSR-LLL
In the following, we assume a 4 4
× MIMO system and illustrate the proposed systolic array in three parts: full size reduction, Givens rotation, and column swap. Prior to the ASLR, QRD of the channel matrix is needed. In this paper, we assume that the matrices H Q and R are computed by the systolic array proposed in [6] . a) Full size reduction: The systolic array for the linear LRAD is shown in Fig. 1 When using systolic array, the advantage of FSR over SR can be shown by the following example. Suppose no column swap is necessary after H is size-reduced. In ASLR, no further processing is needed after FSR. Hence, the systolic array takes a total of 3 3 m − cycles to end the all processes. However, with SR the process will end until columns 2 to m are sequentially size-reduced and it takes Table I , if there exists any k such that δ − | 1,
then ASLR proceeds to the Givens-rotation step. To simplify this condition check in the systolic array, we use a variant of (4) for a reduced lattice,
Since the condition check now only relates two r elements in the neighboring diagonal cells, it can be checked in parallel with the FSR step. For example, in Table I ). The rotation cell simply rotates the input data with the angle given by the neighboring cell. Hence, the vectoring and rotation cells also work in a systolic way, with the rotation angle Θ propagating between cells. Note all diagonal cells could generate the "swap" signal during the FSR step. Therefore, there is a "switch", which is managed by the external controller, between each pair of the diagonal cell and the vectoring cell. If the current value of "order" is even (odd), then the "switch" between each cell 1, 1 k k D − − with even (odd) index k and the vectoring cell is turned on by the external controller. Consequently, for every even (odd) index k , Givens rotation between rows 1 k − and k could be executed if needed.
Additionally, a Givens rotation on rows k and 1 k − can begin right after
is updated during FSR without any interference to the remaining operations of FSR. This way, the time necessary to perform Givens rotations can be hidden by the FSR and this is the reason why we want the Givens rotation to occur prior to column swap in our design. c) Column swap: If columns k and 1 k − of R (and T ) should be swapped, the external controller will send command signals from the top cells of columns k and 1 k − in order to force the swapping data. The command signals propagate vertically downward along these columns. More than one pair of columns could be swapped during one iteration, but all these pairs are swapped in parallel. Hence, the time spent on columns swap is the same as on swapping a single pair of columns. The external controller can send in the command signals after full size reduction and Givens rotation are ended. However, it is still possible that the column swap be partially overlapped in time with size reduction and Givens rotation.
Note that in our description we limit the applications of this 
Comparison between LLL and ASLR
First, we compare the two algorithms in terms of bit-error-rate (BER) performance. In our simulation, 4-QAM is assumed for the transmitted symbols. Fig. 3(a) shows the BER results of MMSE-LRAD based on LLL and ASLR algorithm (denoted as MMSE-LLL and MMSE-ASLR, respectively). The two algorithms lead to almost the same results in all three MIMO systems. Hence, we can conclude that despite LLL and ASLR give different lattice reduced matrices, the linear LRAD based on these two algorithms have similar BER performance.
Next, we compare the efficiency of the systolic array for both algorithms in terms of the average number of column swaps in the overall process. Less column swapping implies less iterations, and thus less cycles in the systolic array. Fig. 3(b) shows the average number of column swaps in LLL and ASLR algorithms of the MMSE-LRAD. For ASLR, we count all the columns swaps during one iteration as only one swap since they are executed in parallel. As the number of antennas grows, the advantage of ASLR becomes significant. In 4 4
× MIMO, the difference between two algorithms is no more than 0.5. However, in a 16 16 × MIMO system, MMSE-ASLR has less than 62% of the column swaps comparing to MMSE-LLL when 0 / b E N is above 10dB. Based on BER performance and time-efficiency comparisons, ASLR should be a better algorithm to be applied on our systolic array, especially with a large number of antennas.
SYSTOLIC ARRAY FOR LINEAR DETECTION
The linear-detection processes described in Section 2 can also be operated on the systolic array in Fig. 1(a) . Consider the ZF detection first. To execute † 1 H − = = x H y R Q y in the systolic array, we separate it into two matrix--vector multiplications Q , which is the same as shown in Fig. 4(a) . Then 2 y enters the array right after 1 y also in a skewed manner, and is multiplied by 2 Q . Hence, for MMSE we need an extra operation at the output of the array, which is R directly, the following recurrence equation [7] is considered for the systolic design 
According to (5) , it is clear that 1 − R v can be computed directly from the components of R . As shown in Fig. 4(b) , the vector v enters the array from the right, and 1 − = x R v is computed by the upper-triangular array with cell operations shown in Fig. 4(e) . The output vector x is then quantized elementwise outside the systolic array. The final step consists of multiplying the quantized vector ˆq
x by the unimodular matrix T , which is very similar to the operations of H = v Q y . Hence, the data flow in Fig. 4(c) is the same as Fig. 4(a) . The cell operations for ˆL R q = ⋅ x T x are shown in Fig. 4(d) , and ˆL R x is the final result of the linear LRAD. In sum, there are one addition, one multiplication, and one division in each diagonal cell, and one addition and one multiplication in each off-diagonal cell for linear detection, be it ZF or MMSE. These operations are also contained in each cell at the LLL lattice reduction stage. Hence, there can be no extra hardware cost (adders or multipliers) in each cell for linear detection. Only extra control logic to the array is needed in order to have each PE work correctly in different modes.
CONCLUSION
In this paper, we proposed a systolic array to perform lattice-reduction-aided linear detection for MIMO receivers. The design is based on all-swap complex lattice-reduction algorithm, which generalizes the one originally proposed in [5] for real lattices. Compared to LLL algorithm, ASLR operates on a whole matrix, rather than on its single columns, during the column-swap and Givens-rotation steps. The linear detection can also be implemented on the same systolic array for the ASLR. Due to the high-throughput property of systolic arrays, our design appears very promising for high-data-rate systems, such as in a MIMO-OFDM system. 
