Abstract-This paper presents the field-programmable gate array (FPGA) implementation of a variant of the LenstraLenstra-Lovász (LLL) lattice reduction (LR) algorithm, known as the Clarkson's Algorithm (CA), and its application to uncoded multiple input-multiple output (MIMO) detection. The CA provides practically the same performance as the LLL algorithm while having a considerably lower complexity, especially for MIMO systems with a large number of transmit and receive antennas. The algorithm has been implemented in real-time using a rapid prototyping methodology, greatly reducing its development time. Implementation results indicate that the variable complexity and the sequential nature of LR algorithms, like the CA, remain their main drawbacks from an implementation point of view.
I. INTRODUCTION
Lattice reduction (LR) has been proposed in the context of uncoded detection of spatially-multiplexed multiple inputmultiple output (MIMO) systems as a means of improving the performance of sub-optimal detectors [1] . This is achieved by preprocessing the MIMO channel by an LR algorithm, which transforms the channel into a more orthogonal equivalent one, and applying a linear or successive interference cancellation (SIC) detector to that new channel [2] . In particular, these LRaided detectors have been shown to achieve the same diversity as the maximum likelihood detector (MLD) [3] . Although a number of LR methods exist in the literature with different levels of performance and complexity [4] , [5] , only recently have those algorithms been studied from an implementation point of view [6] - [8] .
This paper presents a field-programmable gate array (FPGA) implementation of the Clarkson's Algorithm (CA) presented in [9] , a variant of the popular Lenstra-LenstraLovász (LLL) LR algorithm. The LLL algorithm approximates the optimal performance of the Korkine-Zolotareff (KZ) algorithm while having a polynomial average complexity in the dimension of the lattice in uncorrelated Rayleigh fading environments [10] . On the other hand, the CA provides practically the same performance but reduces the complexity of the LLL algorithm by modifying the reduction criterion and eliminating the intermediate size reduction steps [9] . Although the prototyping results of the CA do not match those of the optimized very large-scale integration (VLSI) implementation of the LLL algorithm in [8] , the lower complexity of the CA makes it the most promising LR algorithm from a practical point of view.
A. Lattice Reduction-Aided Detection
We consider a spatially-multiplexed MIMO system with M transmit and N receive antennas, denoted as M × N . The vector of received symbols y ∈ C N ×1 can be modelled as
where s ∈ C M ×1 denotes the vector of transmitted symbols taken independently from a quadrature amplitude modulation (QAM) constellation O of P points with E |s i | 2 = 1/M , for 1 ≤ i ≤ M , and where v ∈ C N ×1 is the vector of independent complex Gaussian noise samples
N ×M has independent elements h j,i ∼ CN (0, 1), for 1 ≤ j ≤ N and 1 ≤ i ≤ M , representing a wireless propagation environment with uncorrelated Rayleigh fading. We assume that the channel is perfectly known at the receiver and that N ≥ M .
The columns of the channel matrix H in (1), h i for 1 ≤ i ≤ M , can be seen as a generator basis of an M -dimensional complex lattice L(H) ∈ C N ×1 , where the lattice is defined as all complex integer combinations of the generator basis, i.e.
L(H)
The main idea behind LR-aided detectors is to obtain a reduced (i.e. more orthogonal) generator basisH for the same lattice L in order to improve the performance of sub-optimal detectors [1] . Two matrices H andH generate the same lattice, L(H) = L(H), if they can be written asH = HT, where T ∈ CZ M ×M is a unimodular matrix with determinant det(T) = ±1 [4] . The system model in (1) can then be rewritten as y =Hx+v, where x = T −1 s. Thus, sub-optimal detectors can be applied to this system model in order to obtain an estimate of x, denotedx, before calculating an estimate of the transmitted vector s, denotedŝ, using the relationship s = Tx. A detailed description of the operation of LR-aided detectors can be found in [11] .
II. CLARKSON'S ALGORITHM
The CA has been proposed in [9] as a means of reducing the complexity of the LLL algorithm while providing a similar
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE ICC 2009 proceedings 978-1-4244-3435-0/09/$25.00 ©2009 IEEE
Swap k − 1th and kth columns ofR and T; performance. They both transform an input basis H = QR into a more orthogonal reduced basisH =QR with the only two algorithmic differences being [9] : 1) The use of the simpler Siegel reduction condition, i.e.
2) The elimination of the intermediate size reduction operations that take place in the LLL algorithm. The size reduction operations in the CA are performed at the end of the algorithm, outside the main loop, except for a single size reduction operation performed before column exchange and Givens rotation. The CA is reproduced in Algorithm 1, using MATLAB notation, where the two aforementioned differences can be observed. In order to compare their performances, Fig. 1 shows the cumulative distribution function (CDF) of the natural logarithm of the condition number ofH, defined as ln(κ(H)) = ln( H −1 H ) where · denotes the Euclidean norm operator, for each one of the algorithms. For a 4 × 4 system ln(κ(H)) is the same for both algorithms, while a very small degradation appears for the CA in an 8 × 8 system.
In addition, Table I shows the average number of times the more computationally intensive sections of the algorithms are executed in a 4 × 4 and an 8 × 8 system if the QR and • The reduction criterion check (crit.) in line 4.
• The computation of μ (coef.) in lines 5 and 21.
• The size reduction (red.) in lines 7, 8, 23 and 24.
• The Givens rotation (rot.) in lines 11-13. The definition of these sections allows for the two algorithms to be easily compared from a hardware complexity point of view, even though each section has a different effect on the overall complexity of the algorithms 1 . All sections are present in both algorithms in exactly the same form (except for the reduction criterion check, which is simpler in the CA). Additionally, each section can be mapped to a specific hardware block so that the overall hardware complexity of both algorithms is almost equivalent 2 . Finally, the number of executions gives a rough estimate of the LR speed for each of the algorithms. Based on the above and looking at the results in Table I , it can be seen that the CA has a lower complexity than the LLL algorithm in all cases, especially for the 8 × 8 system. This reduced complexity translates into a higher LR speed, which in turn leads to a more optimized LR implementation. This same trend has been observed when looking at the standard deviation of the number of executions (not shown). From Table I it can also be observed how the use of the SQR, which iteratively minimizes the diagonal elements of R as they are computed [11] , can reduce the overall complexity of both algorithms (only the red. section of the CA in an 8 × 8 system increases marginally when the SQR is used).
III. FPGA IMPLEMENTATION

A. Platform and Methodology
An FPGA-based rapid prototyping system has been used for the implementation of the CA, providing the flexibility required to move quickly from a computer-based simulation to its hardware implementation. Development has been performed for Digilent's XUP-V2P board hosting a Xilinx Virtex-II-Pro FPGA (XC2VP30). The rapid prototyping methodology selected has been based on The Mathwork's MATLAB and Simulink and Xilinx's DSP System Generator. Initially, MATLAB is used to implement a complete MIMO system including transmitter, channel and receiver. The CA is then implemented on the FPGA using the DSP System Generator. The development of the FPGA model is embedded in a Simulink testbench that facilitates the debugging of the algorithm during the development stage, with the possibility of monitoring every signal in the FPGA model. The CA design is then synthesized for the FPGA using Xilinx's synthesis tools. The main advantage of this rapid prototyping platform and methodology is that it allows us to perform real-time hardware-in-the-loop testing of the algorithm without requiring any knowledge of hardware description languages (HDLs).
B. FPGA Architecture
The CA has been implemented for a 4 × 4 system in order to test its real-time suitability for LR-aided MIMO detection. • Givens Rotation: performs the Givens rotation onQ and R together with the column exchange ofR and T. It corresponds to the rot. algorithm section.
• Subset Mux: this multiplexer, depending on the stage of the algorithm and the result of the reduction criterion, selects the appropriate subsets to be integrated back into the full matrices. • Subset Merge: merges the modified subsets ofQ,R and T back into the full matrices to be stored by the Internal Memory block. In general, the white and light-grey blocks in Fig. 2 play a fundamental storage role, making extensive use of the FPGA flip-flops (FFs). On the other hand, the dark-grey blocks are more computationally intensive. The only exception is the State Machine block, whose complexity comes from a functional point of view, to generate the valid control signals depending on the algorithm state, rather than from a hardware resources point of view (most of the signals in that block are booleans or small integers). In addition, the complexity of the Reduction Criterion block is lower than that of the same block in a LLL implementation, thanks to the simpler Siegel reduction condition.
Thus, the most computationally intensive blocks are the Size reduction and the Givens rotation ones. Even though the main aim of this work was to obtain a proof of concept implementation of the CA using a rapid prototyping methodolgy, a number of measures have been taken to reduce the complexity of those two blocks, without compromising the development methodology. In particular, it can be seen that both blocks require divider architectures. In order to reduce the effect the dividers would have in the overall resource use and the latency of the design, only one divider has been used in the entire design, being reused by both blocks. In addition, the division operations have been transformed into multiplications by the inverse, to reduce the number of divisions performed. On the other hand, an off-the-shelf Xilinx divider block has been used to reduce the development time. The single square root operation required in the Givens Rotation block is performed also by an off-the-shelf block.
Another critical aspect of the Size Reduction and Givens Rotation blocks is the reuse of multipliers in order to reduce the complexity of the design. In the Size Reduction block, one complex multiplier is used for all the size reduction operations onR while another complex multiplier is used for the operations on T. This setup represents a trade-off between the use of multipliers and the latency of the design 3 . In the Givens Rotation block, two complex multipliers and two realcomplex multipliers are running in parallel to perform the Givens Rotation operations on bothQ andR sequentially. Due to the different fixed-point precision used for the elements of Q andR, the values at the input/output of the multipliers had to be converted to/from integer values with no fractional bits so that the same multipliers could be used for both matrices. Overall, the steps described above have helped reduce the resource use of the CA implementation while exclusively using the graphical interface provided by the DSP System Generator.
IV. RESULTS This section shows the FPGA implementation results of the CA in terms of hardware resource use, bit error rate (BER) performance of CA-aided detectors and LR speed. The resource use of the CA implementation is summarized in Table II . Initially, it can be observed how the multipliers are the least used resource due to the multiplier reuse described in the previous section. On the other hand, the slices are the most used resource, although this result can be modified by changing the settings of the place and route tools. Each slice contains two FFs and two look-up tables (LUTs) and the use of those resources is below 40%, indicating that a considerable percentage of slices are only partially used. The random access memory (RAM) blocks are used mostly by the input and output buffers required to perform a number of LRs in every hardware co-simulation run. The results in Table II have been obtained considering a wordlength of 14 bits for the elements ofQ andR and 9 bits for the elements of T. Fig. 3 shows the BER performance of CA-aided MIMO detectors in a 4 × 4 system using 4-, 16-and 64-QAM modulations as a function of the signal to noise ratio (SNR) per bit, 
The FPGA results have been obtained by running the MIMO system in MATLAB and performing the LRs in real-time on the prototyping platform. The fixed-point precision of the design has been adjusted so that the FPGA performance practically matches the floating-point MATLAB performance if 4-QAM modulation is used. This resulted in the following fixed-point formats:Q(1.13),R(5.9) and T(9.0), where the first and second numbers indicate the number of integer and fractional bits, respectively. Lastly, both a zeroforcing (ZF) linear detector and a SIC detector have been simulated. It can be seen how the detectors using the FPGA CA match the performance of those using a MATLAB CA in the 4-QAM case. However, a performance degradation appears as we increase the modulation order, an aspect often not reported in the literature. This is due to the constellation points getting closer to each other as the modulation order increases.
In order to measure the LR speed of the CA, Fig. 4 shows the CDF of the number of cycles required per CA LR using both the QR and the SQR decompositions. As shown in Section II, the use of the SQR reduces the number of cycles required per LR compared to using the QR, with a 37% reduction in the average number of cycles. Similar reductions have been reported for the LLL algorithm in [8] . The number of cycles has been obtained targeting the implementation to match the 100MHz clock frequency limitation of the prototyping board. Different clock frequency/number of cycles tradeoffs could be achieved by adjusting the internal pipeline stages of the blocks in the design in order to find the optimal trade-off point. However, those designs could not be tested in real-time on the hardware platform. Finally the CA performance is compared to that of the LLL presented in [8] in Table III . The LLL implementation is better than that of the CA due primarily to the following two reasons. Firstly, the LLL has been implemented using an HDL allowing a greater flexibility but a longer development time compared to the rapid prototyping methodology used for the CA. Secondly, a number of optimizations have been implemented for the square root and division operations in [8] , as opposed to the off-the-shelf blocks used here. This has an effect on both the slice use and the latency of the design. In particular, the latency of the Size Reduction block is of 37 cycles, with 29 of them used for the division operation while the latency of the Givens rotation block is of 88 cycles, with 73 cycles devoted to the square root and division operations. The number of multipliers of the CA could also be reduced to match that of the LLL by reusing the multipliers between the Size Reduction and the Givens Rotation blocks. It should also be noted that the CA has been implemented on a comparatively older hardware platform which can have a non-negligible effect on the implementation results. In any case, the complexity results of the CA indicate that its implementation using the methodology and platform of [8] would improve the results of the LLL algorithm, making the CA a promising LR algorithm for MIMO detection.
V. CONCLUSION AND FUTURE WORK This paper has presented an FPGA implementation of the CA applied to LR-aided detectors using a rapid prototyping approach. It has been shown how the CA should be considered as a low-complexity alternative to existing LR methods. In particular, we believe that a VLSI implementation of the CA would provide a better performance than the existing LLL one. Finally, the optimization of the square root and division operations in the CA and the prototyping of other LR algorithms, like the Seysen's Algorithm [5] , are the main subjects of ongoing work.
