Spatial division multiplexing (SDM) in MIMO technology significantly increases the spectral efficiency, and hence capacity, of a wireless communication system: it is a core component of the next generation wireless systems, e.g. WiMAX, 3GPP LTE and other OFDM-based communication schemes. Moreover, spatial division multiple access (SDMA) is one of the widely used techniques for sharing the wireless medium between different mobile devices. Sphere detection is a prominent method of simplifying the detection complexity in both SDM and SDMA systems while maintaining BER performance comparable with the optimum maximum-likelihood (ML) detection. On the other hand, with different standards supporting different system parameters, it is crucial for both base station and handset devices to be configurable and seamlessly switch between different modes without the need for separate dedicated hardware units. This challenge emphasizes the need for SDR designs that target the handset devices. In this paper, we propose the architecture and FPGA realization of a configurable sort-free sphere detector, Flex-Sphere, that supports 4, 16, 64-QAM modulations as well as a combination of 2, 3 and 4 antenna/user configuration for handsets. The detector provides a data rate of up to 857.1 Mbps that fits well within the requirements of any of the next generation wireless standards. The algorithmic optimizations employed to produce an FPGA friendly realization are discussed.
Introduction
Multiple-input multiple-output (MIMO) communication systems and spatial division multiplexing (SDM) have recently drawn significant attention as a means to achieve tremendous gains in system capacity and link reliability. Moreover, spatial division multiple access (SDMA) has recently received attention for its promise to increase the sum data rate of different users in wireless networks, and creating a virtual MIMO between multiple users and a base station.
The optimal hard decision detection, in terms of BER performance, for all MIMO wireless systems is the maximum likelihood (ML) detector. However, direct implementation of ML grows exponentially with the number of antennas and the modulation scheme, making its ASIC or FPGA implementation infeasible for all but low-density modulation schemes using a small number of antennas. Sphere detection [1] , and its K-best variation, has been proposed [2] , analyzed [3] and implemented [4] [5] [6] [7] [8] [9] .
As MIMO solutions become more popular and are incorporated into different wireless standards, such as IEEE 802.11n, IEEE 802.16e and upcoming 3GPP LTE, it is crucial to investigate methods to further reduce the complexity of detection while maintaining high BER performance. Conventional K-best MIMO detectors typically require long delay cycles for sorting steps. For instance, for a multi-stage real-valued based K-best detector for a 16-QAM MIMO system, a bubble sorter needs more than 40 cycles if the detector parameter, K, is set to 10. This long list size introduces a large delay for the processing of the next stage. Moreover, in order to achieve higher reliability, it is important to come up with a cost-free ordering scheme that would lead to a further error performance improvement of the system.
The wide range of developing wireless standards require handsets to support a wide variety of schemes with their limited available resources. Most upcoming standards, for example, require the handsets to support one to four antennas as well as QPSK, 16-QAM and 64-QAM modulations. This is a challenging task given that they need to be able to work with different standards and protocols. Therefore, given the area and power constraints of SDR handset devices, it is crucial that they are designed on a common hardware platform and utilize their real-time reconfiguration capability to communicate with heterogeneous networks. This paper presents the architecture and the FPGA implementation of a configurable sphere detector called Flex-Sphere. Flex-Sphere supports three commonly used modulation schemes, 4, 16, 64-QAM, as well as a combination of 2, 3 and 4 antenna and/or user configuration, and can switch between all these parameters in a real-time fashion. These parameters are commonly used in the current and upcoming wireless standards, such as IEEE 802. 16 and 3GPP LTE, and thus, were chosen as the implementation guidelines. However, it should be noted that the proposed architecture can be readily extended to higher order systems. The detector provides a data rate of up to 849.9 Mbps. The breadth-first search employed in our realization presents a large opportunity to exploit the parallelism of the FPGA in order to achieve high data rates. Algorithmic modifications to address potential sequential bottlenecks in the traditional breadth-first search-based SD are highlighted in the paper.
The initial results of this work were presented in [10, 11] ; the simulation results for the hardware implementation of the full 4 × 4 system as well as FPGA synthesis results on the WARP platform, are added in this work. The rest of the paper is organized as follows: Section 2 presents the general system model. The proposed FPGA friendly architecture for the SDR MIMO detector is presented in Section 3. The complexity issues and comparisons are discussed in Section 4.
Section 5 introduces the model-based design of the configurable Flex-Sphere for the SDR handset. The simulation results for floating point and FPGA fixed point of the system for different parameters are given in Section 6. Finally, the paper concludes with Section 7.
System Model
We assume a virtual MIMO system with n transmitters each with L r , r = 1, ..., n antennas such that M T = n r=1 L r , and a receiver, e.g. a basestation, with M R ≥ M T receive antennas. All the transmitters use the same channel to communicate simultaneously with the receiver. The input-output model is captured by
mitted vector from the n transmitters, where each s j , j = 1, ..., M T , is chosen from a complex-valued constellation j of the order w j = | j |, n is the circularly symmetric complex additive white Gaussian noise vector of size M R and y = [ y 1 , y 2 , ..., y M R ] T is the M Relement received vector. Note that we do not restrict all the parallel M T streams to use the same modulation order; rather, each stream, which corresponds to one of the antennas of one of the users, may be using either the 4, 16 or 64-QAM modulation. The preceding MIMO equation can be decomposed into real-valued numbers as follows [12] :
corresponding to
with M = 2.M T and N = 2.M R presenting the dimensions of the new model. We call the ordering in Eq. 2, the conventional ordering. Using the conventional ordering, all the computations can be performed in real values, which would simplify the implementation complexity. Note that after real-valued decomposition, each s i , i = 1, ..., M, in s is chosen from a set of real numbers, i , with w i = w i elements. For instance, for a 64-QAM modulation, each s i can take any of the values in the set = {±7, ±5, ±3, ±1}.
The general optimum detector for such a system is the maximum-likelihood (ML) detector which minimizes y − Hs 2 over all the possible combinations of the s vector. Notice that for high order modulations and large number of antennas, this detection scheme incurs an exhaustive exponentially growing search among all the candidates, and is not practically feasible in a MIMO receiver. However, it is shown that using the QR decomposition of the channel matrix, the distance norm can be simplified [13] as follows:
where H = QR, QQ H = I and y = Q H y. Note that the transition in Eq. 4 is possible through the fact that R is an upper triangular matrix. The norm in Eq. 4 can be computed in M iterations starting with i = M. When i = M, i.e. the first iteration, the initial partial norm is set to zero, T M+1 (s (M+1) ) = 0.
Using the notation of [4] , at each iteration the Partial Euclidean Distances (PEDs) at the next levels are given by
One can envision this iterative algorithm as a tree traversal with each level of the tree corresponding to one i value, and each node having w i children.
The tree traversal can be performed in a breadth-first manner. At each level, only the best K nodes, i.e. the K nodes with the smallest T i , are chosen for expansion. This type of detector is generally known as the K-best detector. Note that such a detector requires sorting a list of size K × w to find the best K candidates. For instance, for a 16-QAM system with K = 10, this requires sorting a list of size K × w = 10 × 4 = 40 at most of the tree levels. This introduces a long delay for the next processing block in the detector unless a highly parallel sorter is used. Highly parallel sorters, on the other hand, consist of a large number of compare-select blocks, and result in dramatic area increase.
Flex-Sphere SDM/SDMA Detector
In order to simplify the sorting step, which significantly reduces the delay of the detector, we propose a novel MIMO detector. This detector is based on a sort-free strategy, and utilizes a new modified real-valued decomposition ordering (M-RVD) scheme.
Tree Traversal for Flex-Sphere Detection
In order to address the sorting challenge, we propose using a sort-free detector. With this technique, the long sorting operation is effectively simplified to a minimum-finding operation. The detailed steps of this algorithm are described below:
An example of this algorithm is illustrated in Fig. 1 for a virtual 4 × 4, 64-QAM system. Note that as described above, the first two levels are fully expanded to guarantee high performance; whereas for the following levels, only the best candidate in the children list of a parent node is expanded. In other words, after passing the first two levels, w M T nodes are expanded, and for each of those w M T nodes, the best child node among its w M children nodes is selected as the survived node. Therefore, the new node list would contain w M T nodes in the third level. These w M T nodes are expanded in a similar way to the forth level, and this procedure continues until the very last level, where the minimumdistance node is taken as the detected node. Moreover, from the Schnorr-Euchner (SE) ordering [14] , we know that finding
basically corresponds to finding the real-valued constellation point closest to 1 R ii b i+1 (s (i+1) ); see Eq. 7. Thus, the long sorting of K-best is avoided.
Modified Real-Valued Decomposition (M-RVD) Ordering
For the sort free detector described in the preceding section, we propose using a novel real-valued decomposition (M-RVD) ordering which improves the BER performance compared to the ordering given in Eq. 2.
The new decomposition is summarized as:
where H is the permuted channel matrix of Eq. 3 whose columns are reordered to match the other vectors of the new decomposition ordering in Eq. 8. It is worth noting that since the difference between RVD and M-RVD is the grouping of the signals, there is no extra computational cost associated with this novel ordering.
Note that with the modified real-valued decomposition (M-RVD) ordering, the first two levels correspond to the in-phase and quadrature parts of the same complex symbol; whereas in the conventional real-valued J Sign Process Syst decomposition scenario, the first two levels of the tree correspond to the quadrature parts of two different complex symbols. A careful look at the tree traversal scheme of the preceding section shows that since the first two levels of the tree are fully expanded, the error performance of the scheme heavily depends on the third level of the tree. Therefore, rather than using the magnitude of R M,M as a metric to choose the decomposition ordering scheme, which justifies the conventional real-valued decomposition (RVD) [15] , we need to look at the behavior of the third lowest diagonal element of the R matrix. As demonstrated in Fig. 2 , there is an increase in the magnitude of R M−2,M−2 when using M-RVD, hence M-RVD is a better choice than the conventional RVD. The impact of M-RVD on the BER performance is discussed in the next sections.
Complexity Comparison
In order to compare the complexity of the proposed MIMO detector, described in the preceding section, versus the conventional K-best technique, we consider the number of operations, the relative latency reduction, and the architecture advantages of the proposed detector.
Number of Operations
In this section, we compute the number of operations required to complete the detection process. Since the channel matrix typically changes at a much slower rate than the received signal vector, we make the assumption that simple channel matrix operations, e.g. R ij s j computations, are performed in a separate preprocessing unit. Note that this simply involves shiftadd operations with s j ∈ . Also, as suggested in [4] , we make the assumption that all the PED norms are approximated by 1 -norms to avoid the squarers and multipliers. Therefore, the only major high rate detec-tor operations, are compare-select for either sorting or minimum-findings, addition and multiplication.
Note that in order to achieve minimum latency, we make the assumption that both detectors use cascaded minimum-finders to sort a list. Therefore, in order to find the best K elements of a list of size l; K cascaded minimum finders are required. So, the number of operations required to sort the best K candidates of a list of size l, denoted by f K (l) in Table 1 , is given by
Given the above assumptions, the total number of operations for the K-best scenario and the proposed Flex-Sphere scheme are given in Table 1 Comparison of the latency and the operation counts between the conventional K-best and the proposed Flex-Sphere detector.
K-best
Flex-Sphere detector
Example ( approach will yield the total number of additions required for the Flex-Sphere detection. 3. Multiplication: The Flex-Sphere uses 1 -norms, and thus, does not need to use the FPGA multipliers; whereas, the K-best scheme needs to compute w 2 -norms in the first level, w norms in the second level, and Kw norms in the remaining (M − 2) levels.
In order to compute the final operation count, comparators are assumed to have unit complexity, and adders to have twice complexity as that of comparators. Multipliers are needed to implement the squarers, and for the wordlengths that we are interested in, i.e. 16 bits, they can be assumed to be ten times more complex than additions. It is worth noting that other relative complexity coefficients would yield similar general results. Based on these relative complexities, the number of operations are plotted for different numbers of antennas in Fig. 3 . The operation count increases for higher K values because higher K means higher number of visited nodes per level; therefore, higher K requires larger computations. Note that except for small K values, the computation overhead of the conventional K-best scheme is considerably more than the proposed Flex-Sphere scheme. More details on the BER performance comparisons will be presented in Section 4.4.
Latency
High latency decreases the data rate in feedback based receivers. For instance, for iterative detector/decoder structures, where the detector uses the feedback data from the decoder to improve the detection performance, higher detection/decoding latency reduces the data rate significantly. A similar argument applies to the overall receiver throughput when the interaction between the physical layer and MAC layer takes more cycles due to the higher physical layer latency. We compare the latency overhead of our proposed detector versus the conventional K-best detector, and show that the Flex-Sphere technique introduces significant latency reduction.
Note that if the detectors are fully parallelized for enhancing data rates, the conventional K-best detector requires K successive minimum finders. The first minimum finder needs to find the minimum among Kw candidates, therefore, has a latency of Kw − 1. The second one needs to find the minimum among Kw − 1, therefore has a latency of Kw − 2, and so on. The proposed Flex-Sphere detector, however, requires only one level of minimum finder as it only needs to find the minimum, i.e. sorting with K = 1. Thus, if we assume full parallelism for both types of detectors, the latency of the sorter that connects one of the middle levels of the tree to the next level is given in Table 1 .
Notice the significant latency reduction that the proposed Flex-Sphere detector promises for the sorting after each level. Also, note that Table 1 represents only the latency of one level; thus, for a 4 × 4 system, there would be M − 3 = 2M T − 3 = 5 of such sorters, see Table 1 .
Architecture
The common K-best sorting requires a bubble-sort architecture [8] . In this architecture, all the nodes need to be passed into the sorter sequentially, and the process of the next level of the tree can not start until all the K × w nodes are passed through the sequential sorter. Even semi-parallel sorters, still require large area and cycles, to finish the detection process, see Table 1 and Fig. 3 . With the Flex-Sphere technique, all the long size sortings are avoided. Moreover, the Flex-Sphere technique is amenable to parallelizing with less overhead than the K-best technique.
Simulation Results
For the BER simulations, the Rayleigh fading channel model is assumed, and the channel matrix is independent for each new transmission. The BER results of 4 × 4 and 3 × 3 systems are compared for a 16-QAM modulation scheme. Note that in order to conduct a fair performance comparison, the K values are chosen such that the K-best technique has similar number of operations as that of the proposed Flex-Sphere scheme, see Fig. 3 . Therefore, based on the results shown in Fig. 3 and Table 1 , K is set to 5 and 4 for the 4 × 4 and 3 × 3 systems, respectively.
The BER simulation results of Fig. 4 suggest that the proposed Flex-Sphere scheme can improve the BER performance more than 5 dB compared to the conventional K-best technique in higher SNR regimes. Note that it was shown in the preceding sections that for a 4 × 4 case, the K = 5 scheme requires similar computational complexity as that of the Flex-Sphere scheme, and it requires 12 times more latency for sorting in each level compared to the proposed sort-free scheme. A similar argument holds for a 3 × 3 system when 
It is also worth noting that in both cases, the M-RVD ordering plays an important role in improving the performance.
FPGA Design of the Configurable Detector for SDR Handsets
In this section, the main features of the architecture and the FPGA implementation of the SDR handset detector are presented. We use Xilinx System Generator [16] to implement the proposed architecture. In order to support all the different number of antenna/user and modulation orders, the detector is designed for the maximal case, i.e. M T × M R , 64-QAM case, and configurability elements are introduced in the design to support different configurations.
PED Computations
Computing the norms in Eq. 7 is performed in the PED blocks. Depending on the level of the tree, three different PED blocks are used: The PED in the first real-valued level, PED 1 , corresponds to the root node in the tree, i = M = 2M T = 8. The second level consists of √ 64 = 8 parallel PED 2 blocks, which compute 8 PEDs for each of the 8 PEDs generated by PED 1 ; thus, generating 64 PEDs for the i = 7 level. Followed by this level, there 8 parallel general PED computation blocks, PED g , which compute the closest-node PED for all 8 outputs of each of the PED 2 s. The next levels will also use PED g . At the end, the Min_Finder unit detects the signal by finding the minimum of the 64 distances of the appropriate level. The block diagram of this design is shown in Fig. 5 .
Configurable Design
In order to ensure the configurability of the Flex-Sphere, it needs to support different M T as well as different modulation orders for different users. The configurability of the detector is achieved through two input signals, M T and q (i) , which control the number of antennas and the modulation order, respectively. These two inputs can change based on the system parameters at any time during the detection procedure. Therefore, this configurability is a real-time operation.
Number of Antennas
The M T determines the number of detection levels, and it is set through M T input to the detector, which in turn, would configure the Min_Finder appropriately. Therefore, the minimum finder can operate on the outputs of the corresponding level, and generate the minimum result. In other words, the multiplexers in each input of the Min_Finder block, choose which one of the four streams of data should be fed into the Min_Finder. Therefore, the inputs to the Min_Finder would be coming from the i = 5, 3 or 1, if M T is 2, 3 or 4; respectively, see Fig. 5 . The M T input can change on-the-fly; thus, the design can shift from one mode to another mode based on the number of streams it is attempting to detect at anytime. Moreover, as will be shown later, the configurability of the minimum finder guarantees that less latency is required for detecting smaller number of streams.
Modulation Order
In order to support different modulation orders per data stream, the Flex-Sphere uses another input control signal q (i) to determine the maximum real value of the modulation order of the i-th level. Thus, q (i) ∈ {1, 3, 7}. Moreover, since the modulation order of each level is changing, a simple comparison-thresholding can not be used to find the closest candidate for Schnorr-Euchner [14] ordering. Therefore, the following conversion is used to find the closest SE candidate:
where [.] represents rounding to the nearest integer, b = (1/R ii ) · b i+1 of Eq. 7, and g(.) is
All of these functions can be readily implemented using the available building blocks of the Xilinx System Generator, see Fig. 6 . Note that the multiplications/divisions are simple one-bit shifts. The pipelined System Generator block diagram for Eq. 11 in the PED g to support different modulation orders.
J Sign Process Syst
For the first two levels, which corresponds to the inphase and quadrature components of the last antenna, the PED of the out-of-range candidates are simply overwritten with the maximum value; thus, they will be automatically discarded during the minimum-finding procedure.
Modified Real Valued Decomposition (M-RVD)
Using the real-valued decomposition, the two extra adders that are required per each complex multiplication, can be avoided; thus, avoiding the unnecessary FPGA slices on the addition operations. Moreover, while using the complex-valued operations require the SE ordering of [4] , which would be a demanding task given the configurable nature of the detector; with the real-valued decomposition, the SE ordering can be implemented more efficiently and simply for the proposed configurable architecture as described earlier. Also, note that even though some of the multiplications can be replaced with shift-adds in an area-optimized ASIC design, as discussed in Section 4; for an FPGA implementation, the appropriate design choice is to use the available embedded multipliers, commonly known as XtremeDSP and DSP48E in Virtex-4 and Virtex-5 devices.
It is noteworthy that if the conventional real-valued decomposition of Eq. 3 were employed; then, the results for a 2 × 2 system would have been ready only after going through all the in-phase tree levels and the first two quadrature levels. However, with the modified real-valued decomposition (M-RVD), every antennas is isolated from other antennas in two consecutive levels of the tree. Therefore, there is no need to go through the latency of the unnecessary levels. Thus, using the M-RVD technique, offers a latency reduction compared to the conventional real-valued decomposition.
Timing Analysis
Each of the PED g blocks are responsible for expanding 8 nodes; thus, the folding factor of the design is F = 8. In order to ensure a high maximum clock frequency, several pipelining levels are introduced inside each of the PED computation blocks. The latency of the PED 1 , PED 2 and PED g blocks are 7, 17 and 22, respectively. Note that the larger latency of the PED g blocks is due to more multiplications required to compute the PEDs of the later levels. The Min_Finder block has a latency of 8.
M T Latency
As mentioned earlier, different values of M T require different number of tree levels, which incurs different latencies. The latencies of the three different configurations of M T are presented in Table 2 . In computing the latencies, an initial 8 cycles are required to fill up the pipeline path. 
5.6 Implementation Results for M T = 4 
This table also presents the implementation results of a previously reported 64-QAM, 4 × 4 system [18] . While the the proposed Flex-Sphere is implemented on a different FPGA device, due its relatively larger size, it can support different number of antennas and modulation orders, and achieves high data rate requirements of various wireless standards. Table 5 summarizes the data rates for all of the different scenarios of the M T = 4, Virtex-5, implementation. Note that the channel pre-processing of [19] is employed to improve the performance.
J Sign Process Syst
Simulation Results
In this section, we present the simulation results for the Flex-Sphere, and compare the performance of the FPGA fixed-point implementation with that of the optimum floating-point maximum-likelihood (ML) results. Prior to the M-RVD, introduced in Section 3, we employ the channel ordering of [19] to further close the gap to ML. Also, we make the assumption that all the streams are using the same modulation scheme. We assume a Rayleigh fading channel model, i.e. complex-valued channel matrices with the real and imaginary parts of each element drawn from the normal distribution.
In order to ensure that all the antennas in the receiver have similar average received SNR, and none of the users messages are suppressed with other messages, a power control scheme is employed. Figure 8 shows the simulation results for the maximal 4 × 4 configuration. As can be seen, the proposed hardware architecture implementation performs within, at most, 1 dB of the optimum maximum-likelihood detection.
Conclusion and Future Work
In this paper, we presented a configurable architecture for multi-user MIMO detection, which can support different number of antennas and modulation orders required by a wide variety of different standards in a real-time way. The proposed architecture enhances the performance of SDR handsets for next generation wireless standards. We also presented the FPGA implementation results of the 3 × 3 and 4 × 4 configurations, and the simulation results suggest that the performance can be made considerably close to the optimum ML detector. It is worth noting that even though the presented results are for hard detection, they can be readily extended to support configurable soft detection scenarios, required for soft iterative detection-decoding schemes [3] . This can be achieved by developing a configurable soft computation block that uses the list of the symbols of the last level for computing the soft information. Comparing the performance of this soft detection strategy with other soft detection strategies forms the next step of the work.
