Abstract-Spatial multiplexing multiple-input-multiple-output (MIMO) communication systems have recently drawn significant attention as a means to achieve tremendous gains in wireless system capacity and link reliability. The optimal hard decision detection for MIMO wireless systems is the maximum likelihood (ML) detector. ML detection is attractive due to its superior performance (in terms of BER). However, direct implementation grows exponentially with the number of antennas and the modulation scheme, making its ASIC or FPGA implementation infeasible for all but low-density modulation schemes using a small number of antennas. Sphere decoding (SD) solves the ML detection problem in a computationally efficient manner. However, even with this complexity reduction, real-time implementation on a DSP processor is generally not feasible and high-performance parallel computing platforms such as FPGAs are increasingly being employed for this class of applications. The sphere detection problem affords many opportunities for algorithm and micro-architecture optimizations and tradeoffs. This paper provides an overview of techniques to simplify and minimize FPGA resource utilization of sphere detectors for highperformance low-latency systems.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) systems are known for their capability of achieving high data rates [1] and increasing the robustness to combat the fading in wireless channels. However, the complexity of the optimum detector, i.e. maximum-likelihood (ML) receiver, for MIMO systems grows exponentially with more antennas and higher modulation orders. In order to reduce this complexity, sphere detection [2] , and its K-best variation, has been proposed [3] , analyzed [4] and implemented [5] , [6] , [7] , [8] , [9] .
MIMO solutions have become more popular during the recent years, and are becoming an option in several wireless standards. Therefore, it is crucial to study methods that further reduce the complexity of detection while maintaining high BER performance. Conventional K-best MIMO detectors typically require long delay cycles for sorting steps. For instance, for a multi-stage real-valued based K-best detector for a 16-QAM MIMO system, a bubble sorter needs more than 40 cycles if the detector parameter, K, is set to 10. This long list size introduces a large delay for the processing of the next stage.
In this paper, we present the FPGA implementation of a configurable MIMO detector that supports 4, 16, 64-QAM modulation schemes as well as a combination of 2, 3 and 4 antennas. The detector can switch between these parameters on-the-fly. The breadth-first search employed in our realization presents a large opportunity to exploit the parallelism of the FPGA in order to achieve high data rates. Moreover, the extension of the detector to soft detection and its architecture implications are discussed.
The paper is organized as follows: Section II introduces the system model, section III introduces the MIMO detector. The FPGA design and implementation are discussed in section IV, and the extension to soft detection/decoding is presented in section V. Finally, the papers is concluded with section VI.
II. SYSTEM MODEL
We consider a MIMO system with M T transmit and M R receive antennas. The input-output model is captured bỹ y =Hs +ñ (1) whereH is the complex-valued
T is the M T -dimensional transmitted vector whose elements are chosen from a complex-valued constellation Ω of the order w = |Ω|,ñ is the circularly symmetric complex additive white Gaussian noise vector of size
T is the M R -element received vector. Each modulation constellation point corresponds to M c = log w bits. The preceding MIMO equation can be decomposed into real-valued numbers as follows [8] :
corresponding to
with M = 2 · M T and N = 2 · M R presenting the dimensions of the new system. We call the ordering in (2), the conventional ordering. Using the conventional ordering, all the computations can be performed using only real values. Note that after real-valued decomposition, each s i in s is chosen from a set of real numbers, Ω , with w = √ w elements.
III. MIMO DETECTION
The optimum detector for such a system is the maximumlikelihood (ML) detector. ML is essentially based on minimizing y − Hs 2 over all the possible combinations of the s vector. The ML detection requires exhaustive exponentially growing search among all the candidates, that can become practically impossible when large number of antennas are used. In order to address this challenge, the distance metric is modified [10] as follows:
where H = QR represents the channel matrix QR decomposition, QQ H = I, R is an upper triangular matrix, and y = Q H y. Using the notation of [5] , the norm in (4) is computed in an iterative process. Starting with T M +1 (s (M +1) ) = 0, the Partial Euclidean Distance (PED) at each level is given by
with
T , and i = M, M − 1, ..., 1. This iterative algorithm can be implemented as a tree traversal with each level of the tree corresponding to one i value, and each node having w children. The tree traversal can be performed in a breadth-first manner. At each level, only the best K nodes, i.e. the K nodes with the smallest T i , are chosen for expansion. This type of detector is generally known as the K-best detector. Note that such a detector requires sorting a list of size K × w to find the best K candidates. For instance, for a 64-QAM system with K = 16, this requires sorting a list of size K × w = 16 × 8 = 128 at most of the tree levels. This introduces a long delay for the next processing block in the detector unless a highly parallel sorter is used. Highly parallel sorters, on the other hand, consist of a large number of compare-select blocks, and result in dramatic area increase. In order to simplify the sorting step, and significantly reduce the delay of the detector, a minimum finder can replace the sorter [6] , [11] , [12] .
The soft information, typically Log-likelihood Ratio (LLR), passed from the detection block to the decoding block is obtained by
where
This soft information is updated in the decoder and fed back into the detector. Multiple cycles of exchanging soft information between the detector and decoder would eventually lead to more reliable soft information, which will be used by the decoder, in the last iteration, to hard-decode more reliably. Soft information can be generated using a list of possible vector candidates. Once this list is generated, LLR values of Eq. (7) are computed and passed to the decoder [4] :
where L is the list of possible vectors,
is the noise variance, X k,+1 is the set of 2 MT ·Mc−1 bits of vector x with x k = +1, while X k,−1 is similarly defined.
IV. FPGA DESIGN OF THE MIMO DETECTOR
The detector is designed for the maximal case, i.e. M T × M R , 64-QAM case, so that it can also support a smaller number of antennas and modulation orders.
Computing the norms in (4) is performed in the PED blocks. Depending on the level of the tree, three different PED blocks are used: The PED in the first real-valued level, PED 1 , corresponds to the root node in the tree, i = M = 2M T = 8. The second level consists of √ 64 = 8 parallel PED 2 blocks, which compute 8 PEDs for each of the 8 PEDs generated by PED 1 ; thus, generating 64 PEDs for the i = 7 level. Followed by this level, there are 8 parallel general PED computation blocks, PED g , which compute the closest-node PED for all 8 outputs of each of the PED 2 s. The next levels will also use PED g . For any incoming node, PED g computes and forwards only the best children; whereas, both PED 1 and PED 2 forward all the expanded children. At the end of the very last level, the Min Finder unit detects the signal by finding the minimum of the 64 distances of the appropriate level. The block diagram of this design is shown in Figure 1 . The M T determines the number of detection levels, and it is set through M T input to the detector, which in turn, would configure the Min Finder appropriately. Therefore, the minimum finder can operate on the outputs of the corresponding level, and generate the minimum result. In other words, the multiplexers in each input of the Min Finder block, choose which one of the four streams of data should be fed into the Min Finder. Therefore, the inputs to final the Min Finder would be coming from the i = 5, 3 or 1, if M T is 2, 3 or 4; respectively, see Figure 1 .
The M T input can change on-the-fly; thus, the design can shift from one mode to another mode based on the number of streams it is attempting to detect at anytime. Moreover, as will be shown later, the configurability of the minimum finder guarantees that less latency is required for detecting smaller number of streams.
In order to support different modulation orders per data stream, the Flex-Sphere uses another input control signal th (i) to determine the maximum real value of the modulation order of the i-th level. Thus, th (i) ∈ {1, 3, 7}. Moreover, since the modulation order of each level is changing, a simple comparison-thresholding can not be used to find the closest candidate for Schnorr-Euchner [13] ordering. Therefore, the following conversion is used to find the closest SE candidate:
where [.] represents rounding to the nearest integer, and g(.) is
The above procedure is performed in PED g to ensure selecting candidates within the proper range. In PED 1 and PED 2 , i.e. the first two levels, the PED of the out-of-range candidates are simply overwritten with the maximum value; thus, they will be automatically discarded during the minimum-finding procedure.
As for the real-valued decomposition, we use the modified real-valued decomposition (M-RVD) ordering of [11] , [12] . In M-RVD, unlike the conventional ordering, each quadrature component is followed by the in-phase component of the same antenna. In other words, with the modified real-valued decomposition (M-RVD), every antenna is isolated from other antennas in two consecutive levels of the tree. Therefore, if we use conventional real-valued decomposition, the results for a 2 × 2 system would be ready only after going through all the in-phase tree levels and the first two quadrature levels, while, using M-RVD, there is no need to go through the latency of the unnecessary levels. Thus, using the M-RVD technique offers a latency reduction compared to the conventional real-valued decomposition.
A. FPGA Synthesis Results
The System Generator FPGA implementation results of the MIMO detector on a Xilinx Virtex-5 FPGA, xc5vsx95t-3ff1136 for 16-bits precision and M T = 4 are presented in Table I . The maximum achievable clock frequency is 285.71 MHz. The folding factor of the design is F = 8, thus, the maximum achievable data rate is
for M T = 4 and w i = 64. 
B. Simulation Results
In this section, we present the simulation results for the Flex-Sphere, and compare the performance of the FPGA fixedpoint implementation with that of the optimum floating-point maximum-likelihood (ML) results. Prior to the M-RVD, we employ the channel ordering of [14] to further close the gap to ML. Also, we make the assumption that all the streams are using the same modulation scheme. We assume complexvalued channel matrices, with the real and imaginary parts of each element drawn from the normal distribution.
In order to ensure that all the antennas in the receiver have similar average received SNR, and none of the users messages are suppressed with other messages, a power control scheme is employed. Figure 2 shows the simulation results for the maximal 4×4 configuration. As can be seen, the proposed hardware architecture implementation performs within, at most, 1 dB of the optimum maximum-likelihood detection.
V. SOFT DETECTION/DECODING
The list of candidates generated at the last level of the MIMO detector can be used to generate soft values, i.e. LLRs, using Eq. (8) . Those LLRs will be, then, used by the channel decoder to decode the information bits. Figure 3 provides a schematic representation of Eq. (8) . The inputs to the computation is the length M T M c vector of bit-level APP probabilities computed by the outer channel decoder, a list of P candidate output vectors from the MIMO sphere detector, each bit vector is of length M T M c and finally a P -vector of distance metrics, or costs, for each of the P candidates in the sphere detectors output symbol list.
To determine the cost, in terms of time initially, for computing the soft outputs from the list of candidates generated by the Sphere Detector, first consider the number of clock cycles required to compute Eq. (12) for a single candidate using a sequential approach. Since both x [k] and L A,[k] exclude the k'th bit of the harddecision bit-vectors in the list of candidates generated by the sphere detector, and further that each entry of x [k] takes on the values of only ±1, the inner product
clock cycles using only a single adder. One further addition is required to form the sum
. This component of the calculation is completed by taking a Jacobian logarithm. All of the candidates in the list need to be processed, and assuming that there are K · |Λ| such candidates, where |Λ| denotes the cardinality of the constellation, results in the time required to compute the soft value for a single bit in the length
where T jacln is the time to compute a Jacobian logarithm. The difference between the two primary terms in Eq. (8) corresponding to x ∈ X k,+1 and x ∈ X k,−1 requires one subtraction, and there are M T · M c such calculations. Combining this cost gives the final workload T 1 for computing the soft value for a single bit as
The hard decision bit vector contains M T ·M c entries, for each of which a soft value needs to be computed, giving the total time T sof t for computing the soft output for all of the bits as
Scaling by the noise variance term −1/σ 2 in Eq. (8) can be handled as a pre-processing phase to computing the softoutputs. That is, prior to engaging the soft-output generation circuit the K·|Λ| length list of cost metrics is scaled by −1/σ 2 . The cost of the scaling by 1/2 in Eq. (8) is also not included in the calculations as this is realized in hardware as a simple bit shift that is accommodated in the circuit wiring and does not incur any compute fabric cost in an FPGA. 
