We propose a carefully selected receiver structure, detector and detector implementation architecture for multiple-input multiple-output (MIMO) uplink base station receiver for fourth generation (4G) wireless cellular systems. First, we compare different receiver algorithms and structures for single-carrier frequency-division multiple access (SC-FDMA) uplink transmission to get a good understanding of the performance and complexity of these algorithms and their suitability for practical realization. One of those structures, namely the frequency domain MMSE equalization with sphere detection (SD), is proposed for implementation. The receiver consists of separate stages for inter-symbol interference (ISI) and inter-antenna interference (IAI) mitigation in frequency selective MIMO channels. Frame error rate (FER) performance is studied via simulations in realistic wireless channels and practical system parameters. K-best SD is selected as a detector algorithm for this receiver. There are several publications proposing a sort-free architecture for tree search type of detectors. Both a conventional K-best architecture and a sort-free architecture are implemented on a Xilinx Virtex-6 field-programmable gate array (FPGA) using High Level Synthesis (HLS) tool. Both architectures support 4×4 MIMO with 64-level modulation (64-QAM). Complexity results confirm that avoiding the sorter is not always recommended. The benefit of sort-free architecture depends on the system parameters.
Introduction
Third generation partnership project (3GPP) Long Term Evolution (LTE) [1] uses single-carrier frequencydivision multiple access (SC-FDMA) as the uplink transmission scheme [2] . The same is true for its further evolution LTE-Advanced (LTE-A) [3] . SC-FDMA has been selected in the uplink instead of orthogonal frequency division multiplexing (OFDM) mainly because of its reduced peak-to-average power ratio to reduce the mobile transmitter cost by allowing cheaper power amplifiers [4] . A consequence of the SC transmission is the fact that inter-symbol interference (ISI) is unavoidably introduced and an equalizer is needed in the receiver. Frequency domain (FD) linear minimum mean square error (MMSE) receivers have been studied extensively for a single carrier transmission [5, 6] and are probably still the most predominant in practical realizations. Also more advanced turbo based receivers have been considered in basic research [7] [8] [9] [10] , but their complexity is still typically too high for most commercial products.
Multiple-input multiple-output (MIMO) communications [11, 12] has been standardized also for LTE uplink to increase the peak data rates. The LTE-A standard specifies up to four transmit antennas in the user terminal. Similar to any spatial multiplexing based MIMO transmission, inter-antenna interference (IAI) is induced and a tailored spatial equalizer is required in the receiver. Both linear and nonlinear receiver structures have been extensively considered for MIMO receivers with an emphasis on ones operating in OFDM systems, wherein ISI is not a problem. In particular, different variants of so called sphere detector (SD) [13] , which calculate the maximum likelihood (ML) solution with reduced complexity, have received a lot of attention in the literature. The list sphere detector (LSD) [14] is useful in practical systems employing forward error control (FEC) coding, because it approximates the maximum a posteriori probability (MAP) detector producing soft outputs for the channel decoder. Implementations of different LSD versions and other tree-search algorithms have been considered earlier mostly in the the downlink MIMO-OFDM context in [15] [16] [17] [18] [19] [20] .
The introduction of the spatial multiplexing based MIMO concept to the LTE-A uplink means that the base station receiver is encountering further challenges. It must cope with both frequency-selectivity induced ISI and spatial domain IAI. This calls for joint consideration of both problems. The algorithm complexity will unavoidably increase and also the realization becomes a major challenge. The stringent real time and latency requirements of receiver processing make it necessary to perform joint algorithm and architecture optimization. The most conventional MIMO receiver structure consists of the frequency domain linear MMSE equalizer optimized for both ISI and IAI. It performs reasonably well, but suffers significant error rate increase for large numbers of antennas [21, 22] . A promising way to improve its error rate is to apply separate stages for ISI and IAI mitigation [23] ; a similar idea has earlier been applied, e.g., in [24] in the wideband code-division multiple-access (WCDMA) context. The MMSE filter can be first applied to suppress the ISI as would be done in conventional single-input single-output SC-FDMA communications. Another, in general nonlinear, equalizer stage is then subsequently used for MIMO detection, i.e, equalizing the IAI between the spatial streams.
Our objective is to propose an efficient receiver structure, SD algorithm and SD implementation architecture for real LTE-A systems and their SC-FDMA based uplink base stations. We proposed a new receiver structure frequency domain linear MMSE filter with sphere detection for SC-FDMA uplink transmission and compared this to conventional frequency domain linear MMSE equalization with soft demodulation receiver structure in [23] . Two different tree search algorithms were considered for the latter one, namely the K-best [25] LSD algorithm and the selective spanning with fast enumeration (SSFE) algorithm [26] . In this paper, we have revised the frame error rate (FER) performance and complexity analysis of these receivers. As a completely new result, the K-best LSD algorithm used for the sphere detection part of the later receiver structure is chosen for more detailed analysis. Two different architectures for this K-best LSD algorithm are implemented on a Xilinx field-programmable gate array (FPGA) using High Level Synthesis (HLS) tool to get good understanding of their complexity and analyze whether so-called sort-free architecture has gain over straightforward implementation using common insertion sorter. The HLS tool enables the implementation and comparison of different architectures in relatively short time. The architecture optimization can be done in the C language level, which gives a clear benefit in terms of design time and effort. The area efficiency could probably be further optimized with traditional design approach but HLS is well suited for this type of architecture evaluation.
System Model
A single carrier based vertically encoded MIMO transmission system with T transmit and R receive antennas is considered. The system model is presented in Figure  1 . The encoded data stream is interleaved and modulated into symbols. After the parallel-to-serial conversion, a cyclic prefix (CP) is added. At the receiver, a K-point DFT is performed and the symbols from the allocated carriers are selected. After the frequency domain equalization, the symbols are transformed into time domain and the detector is used to calculate the bit log-likelihood ratios (LLR) for the decoder. After CP removal, the received signal vector r ∈ C RK can be expressed as
where v ∈ C RK is independent and identically distributed circularly symmetric complex Gaussian noise with variance σ 2 and zero mean, x ∈ C T K is the transmitted signal, H ∈ C RK×T K is the circulant block channel matrix and K is the length of the DFT. The channel matrix H can be written as
where H r,t ∈ C K×K is a channel submatrix between tth transmit and rth receive antenna.
SC-FDMA MIMO Receivers
The most conventional MIMO receiver structure consists of the frequency domain linear MMSE equalizer with a soft demodulator. The ISI and IAI terms are both counteracted by the same linear MMSE filter. The structure is illustrated in Figure 2 . The soft demodulator is used to calculate the log-likelihood ratios for the decoder. No further IAI suppression is performed in the soft demodulator as the LLRs are calculated separately for each stream. The structure is well known in the literature. It performs well but not optimally. One option to improve the performance of the conventional MMSE based MIMO receiver would be a time domain sphere detector with combined mitigation of ISI and IAI. The time domain channel matrix for the QR decomposition (QRD) would have dimensions of R × T × L, where L is the length of the channel and R and T are the number of receive and transmit antennas, respectively. This means that the complexity explodes when the number of antennas and channel length increase. The time domain sphere detector would in principle give good communication performance, but it would be too complex for most practical implementations in the 4×4 MIMO case with the processing power available with the current technology.
Another option to improve the performance of a MIMO receiver, is the frequency domain MMSE filter with sphere detection. Therein, the ISI and IAI mitigation are performed in separate stages and complexity is much lower than that with time domain SD processing. This is the approach considered in the sequel. The MMSE filter is first applied to suppress the ISI like in conventional single-antenna SC-FDMA communications. Its operation can also be interpreted as a channel shortening filter, producing a shortened channel matrix for the sphere detector. The sphere detector is subsequently used for MIMO detection, i.e, removing the IAI between the spatial streams. Several different tree search algorithms could be used to perform the MIMO detection in this receiver structure. Here we consider the K-best LSD and SSFE algorithms as a candidate for our implementation. The receiver structure for vertically encoded R × T MIMO is illustrated in Figure   3 . The MMSE filter and two different tree search algorithms are described in more detail below. Fig. 3 Receiver with sphere detection.
MMSE Filter
The linear MMSE filter coefficients are derived to cancel ISI and the filter coefficients Ω ∈ C RK×RK can be determined according to the following criterion [10] 
where e is the mean square error (MSE), the expectation E{·} is respect to x and v,H ∈ C RK×T K is the target channel matrix and it consist of submatrices diag(H r,t ), i.e, the diagonal elements (1st channel
is the matrix trace operator and ⊗ is the Kronecker product. The MMSE filter can be written as
where Σ r = ΓΓ H + σ 2 I ∈ C RK×RK , I ∈ R RK×RK is an identity matrix and the frequency domain channel matrix Γ = F R HF −1 T ∈ C RK×T K . The (i, j) term of the equivalent channel Φ ∈ C R×T can be calculated as
where i = 1, ..., R and j = 1, ..., T and the (i, j) term of the covariance of residual interference Σ w ∈ C R×T as
The equalized signal z ∈ C RK after the IDFT, can be written as
After the frequency domain filtering, the noise is not white and has covariance matrix Σ w from (5). The likelihood function term 1/σ 2 ||z − Φs|| 2 2 then becomes Σ −1/2 w ||z − Φs|| 2 2 , where s is a transmitted symbol vector candidate. The covariance of residual interference can be taken into account either by whitening the noise or including it in the distance calculations. The whitening can be done by multiplying z and Φ with the inverse square root of the covariance matrix Σ w , i.e, z w = Σ
, when Σ w = UΛU H and Λ contains the eigenvalues and U contains the eigenvectors of Σ w .
Sphere detector
The structure of the sphere detector is presented in Figure 
The tree search is then performed separately for each whitened symbol vector. The squared partial Euclidean distance (PED) of s T i , i.e., the square of the distance between the partial candidate symbol vector and the partial received vector, can be calculated in the sphere detector block as
where i = T . . . , 1 and s T i denotes the last T − i + 1 components of vector s [13] .
The resulting list of candidate symbol vectors L is demapped into binary form and the LLR for the transmitted bit k is calculated as
where
and Θ is the set of possible transmitted symbol vectors. The LLRs can be updated from the decoder feedback L A as:L
is a vector corresponding to k from the transmitted binary vector b. Two different tree search algorithms are considered in this paper. The K-best LSD algorithm [25] is a breadth-first search based algorithm, which keeps the K nodes which have the smallest accumulated Euclidean distances at each level. If the PED is larger than the squared sphere radius C 0 , the corresponding node will not be expanded. We assume no sphere constraint or C 0 = ∞, but set the value for the list size K instead, as is common with the K-best algorithms. Figure 5 illustrates the K-best tree search structure for real valued 2 × 2 antenna system using 16-QAM and list size 4. In a complex valued system there would be only two levels but on each level the parent node would be expanded into 16 nodes. Figure 6 shows the tree search structure for the same system as in Figure  Fig The slicer unit is an essential part of the SSFE algorithm. It selects a set of closest constellation points s i such that the PED increment is minimized at each level e.g.,
Minimizing
Equation (13) is essential for the slicer unit which selects the closest constellation points based on ε.
Communications Performance
The performances of the conventional MMSE receiver and the frequency domain MMSE filter with two different tree search algorithms were compared in Matlab simulations. K-best LSD algorithm was simulated with list size 8 Table  1 . Pedestrian A, Vehicular A and Pedestrian B channel models were used in the simulations [27] . The channel parameters are described in Table 2 . As can be seen from the multipath profile values, Pedestrian A channel is the least and Pedestrian B channel the most frequency-selective causing powerful ISI term. The chosen azimuth spread values result in spatially correlated channels making the case both realistic and very challenging for the MIMO equalizer. The number of transmit and receive antennas is four. This illustrates the most challenging case of high data rate and significant IAI. True synchronization process is performed in the simulator. However, maximum uncertainty for time synchronization is set low enough to virtually eliminate the effect of synchronization error. The effect of sync error has not been studied in this paper. Figure 8 and in a correlated Pedestrian B channel in Figure 9 . The performances in a 2 × 2 MIMO and 1 × 4 (virtual MIMO) scenarios were also simulated, but the results are not reported herein, because they are basically similar as those for the 4 × 4 case. All the simulations were also performed with 1/2 code rate. The results followed these code rate 2/3 results, but the performance differences were slightly smaller.
In [23] we also simulated turbo receiver performances but those results had much more variance. Although being superior with limited set of parameters (code rate, ISI and IAI) there were many scenarios where the receiver did not converge even with high number of iterations. In summary the turbo receiver improved the performance especially in scenarios when there was no IAI. Therefore turbo receivers might be suitable for OFDM systems, wherein ISI is not a problem. Additionally, the full analysis whether it would be possible to accept the latencies of turbo structures in LTE Rel 8 -Rel 10 uplink receiver implementation would require further study.
The simulation results show that the 2-stage receiver with a time domain sphere detector using Kbest algorithm (8-best, 16-best) implementation in which good performance has been achieved with pre-processing. However, the main compute complexity of this implementation is in the antenna ordering, the pre-processor being 3.3x more complex than the actual SD. Based on the simulations 8-best, 16best and SSFE[4,3,2,2,1,1,1,1] would all be suitable for the 2-stage receiver implementation without any additional pre-processing. However, our complexity estimation results for these algorithm [23] show that 16best algorithm would be twice as complex and the SSFE[4,3,2,2,1,1,1,1] algorithm would be 50% more complex than the 8-best algorithm. As a result, Kbest list sphere detector with list size of 8 was decided to be implemented on an FPGA. It offers the best performance-complexity ratio for the practical implementation in the channel conditions studied herein.
Development environment
The HLS tool was decided to be used for generating the RTL instead of hand written RTL. The HLS tools are gaining popularity and they are challenging the traditional design approach. There are several studies showing that these tools increase the design productivity and reduce the development time, while producing compet-itive quality of results compared to hand written RTL [29, 30] .
The implementation tool flow can be seen in Figure 10 . The algorithms were first written using Matlab to enable comprehensive simulations and comparisons in our SC-FDMA Matlab link level simulator. The selected HLS tool uses C code as a source. Thus, C versions of selected algorithms had to be written. MEX interface in Matlab enabled us to verify the C version was identical to original Matlab version.
Xilinx Vivado HLS tool was used for converting the C code into RTL (in this case VHDL). The tool gives a new abstraction level and it hides some of the complexity of design implementation. The HLS tool generates a high-performance pipelined architecture based on the constraints, directives and implementation C/C++ code. The constraints include, for example, the target FPGA family and the target clock frequency. The directives guide the HLS tool, for example, to unroll loops or partition arrays. The input is not the original reference C/C++ code. Instead, the reference code has been restructured so that it represents an architecture targeted by the designer. Figure 11 shows iterative code restructuring phase of the design flow. The HLS tool generates the RTL output based on these inputs and reports the throughput performance and estimate of the complexity of the architecture. The designer can then iteratively change the directives and the C/C++ source code as long as the throughput requirements have been satisfied. With HLS tool it is possible to generate a valid high complex solution in relatively short time, but a highly optimized low complex solution requires many iterations. The iterative design approach enables the trade of between the quality of results and development time.
In the next phase, the output RTL is used as an input for the FPGA implementation tool (Xilinx ISE/EDK). The final achievable clock frequency and resource usage are reported after logic synthesis and Place & Route. If the result do not satisfy the designer, directives or implementation C/C++ code could be modified further.
Implementation
The K-best LSD algorithm was implemented on a Virtex-6 XC6VLX240T FPGA with speed grade -2. Implementation started with requirement specification and input/output (I/O) specification. After that an initial architecture was planned. Matlab model of the 8best LSD algorithm was written again using C code. The C code was verified again after every modification. HLS tool gave the possibility to generate several different solutions and choose the best one. The throughput of 347 Mbps can be achieved for example with the following parameter combinations: 115 MHz/8 cycles, 230 MHz/16 cycles or 460 MHz/32 cycles. 460 MHz is somewhat too high a frequency target for FPGA and 115 MHz very loose one. Scheduling the design in 8 clock cycles wastes resources, because a higher frequency could be achieved. We decided to implement two different architectures which both achieve 347 Mbps: Architecture I including a challenging sorting operation of the tree search and Architecture II without sorting operation. Both architectures were implemented with the similar amount of optimization.
Implementation requirements

Macroarchitecture Specification
Architecture I
Architecture I describes the structure of the K-best algorithm without trying to avoid sorting operation. Eight PEDs are calculated on the first level. On levels 2-8, 8 more distances are calculated resulting in 64 PEDs. These 64 PEDs need to be sorted resulting in 8 surviving PEDs. Levels 2-8 need to have a sorter. Sorting N samples requires N operations if there is no pre-information about the samples. With synchronous logic this means that a level including sorter can not be scheduled in less than 64 cycles (pipeline initiation interval ≥ 64). The selected sorter here is the insertion sorter. For large data sets asymptotically efficient sorters like quicksort, heap sort and merge sort could be used. However, 64 samples can be considered as a relatively small data set and insertion sort is one of the most common and efficient sorter algorithms available for small data sets.
The macroarchitecture shown in Figure 12 was designed for the Architecture I. PED 1 calculates 8 distances and does not require sorting. PEDs 2-8 calculate 64 distances and include an insertion sorter.
Architecture II
Alternative Architecture II was also implemented in which the goal was to avoid the sorting operation. In the Architecture I, all the 64 PEDs are sorted and then the 8 smallest ones are selected. Whereas, in the Architecture II 8 smallest ones are selected directly. For the sort-free architecture we are using the method used in [31] , [32] and [33] . It is possible to find K smallest PEDs in less than K cycles, if we use regularities of constellation points and pre-sorted PEDs. Slicing operation, used in Schnorr-Euchner enumeration, is used to find the smallest child from parent node. Next min-search can be used to find the smallest out of different parents pre-ordered childs. Figure 13 [33] shows the idea of the sort-free method for the K-best algorithm. The key idea of the distributed K-best scheme is to find the first child of each node in K l + 1. Among these first children the one with the lowest PED is definitely one of the K best candidates in K l . That child is selected and is replaced by its next best sibling. This process is repeated K times to find the K best candidates in level l (K l ). This structure finds the K best candidates in just K clock cycles. Figure 14 shows the planned macroarchitecture for the alternative structure, which we here call Architecture II. 
Parameterization
Parameterization was used in re-writing of the C code. Example below shows the C++ template function for levels PED 2-8 (PED 1 has its own function). Kbest LSD 8 gets the level of the tree search as an template parameter. Parametrization gives the ability for HLS tools to use more resource sharing and that way reduce the FPGA resources. 
FPGA optimization
Two examples of FPGA optimization, used in these implementations, are bit-width optimization and efficient use of embedded DSP blocks. The reference C/C++ uses normal C/C++ data types (e.g. short, int). To optimize the bit-widths, fixed point data types are used in the implementation C/C++ code. Arbitrary bit-widths are required so that, for example, 32-bit multipliers are not wasted when less bits are required. ap int.h header enables arbitrary precision integers and ap fixed.h enables arbitrary precision fixed-point data types to be used in the C code.
A specific example of efficient use of embedded DSP blocks is DSP48s usage. The use of DSP48s improves timing and FPGA resource utilization significantly. The structure of DSP48 block is shown in Figure 15 . Here is an example of a multiplication followed by an addition. These two operations can be mapped in to a single DSP48 block. The modified function call looks like this / * M u l t i p l y i n g c a n d i d a t e symbol w i t h i n d e a p i n t <25> i a = ( a p i n t <25>)A. r a n g e ( 2 4 , 0 ) ; a p i n t <18> i b = ( a p i n t <18>)B . r a n g e ( 1 7 , 0 ) ; a p i n t <48> i c = ( a p i n t <48>)C . r a n g e ( 4 7 , 0 ) ; a p i n t <48> i d = multadd25x18<ADDSUB>( i a , ib , i c ) ; a p f i x e d <48 , i w i d t h a+i w i d t h b+5> r ; r . r a n g e ( 4 7 , 0 )=i d . r a n g e ( 4 7 , 0 ) ; return r ; } Finally, template function macc25x18 calls mul-tadd25x18 to perform the actual calculation. Two directives in multadd25x18 instruct the high-level synthesis tool to use a maximum of two cycles to schedule these operations and use a register for the output return value.
// * * * * * * * * * * multadd25x18 * * * * * * * * * * template<bool ADDSUB> a p i n t <48> multadd25x18 ( a p i n t <25> A, a p i n t <18> B, a p i n t <48> C) { #pragma AP INTERFACE ap none p o r t=return r e g i s t e r #pragma AP LATENCY max=2 i f (ADDSUB) return C + A * B ; e l s e return C − A * B ; }
Implementation results
The Architecture I, including insertion sorter, schedules in 64 cycles, which means that every 64th cycle a new input vector y is taken in, where y includes one symbol 
However, several parallel blocks can be used. 20MHz SC-FDMA transmission consists of 1200 subcarriers. The same channel matrix is used while 1200 data symbols from each of the four antennas are processed. One data symbol from each of these four antennas is used at the time in SD processing. All the 1200 data symbols are received during the same symbol period 83 µs. One data symbol occupy full BW for a 1/1200 symbol period and the channel matrix is the same during this 83 µs period. In order to achieve the required 347 Mbps, four detectors can be used. Each of these process 300 data symbols from four antennas. This leads to a total 372 Mbps detection rate. In case of short cyclic prefix, five detectors can be used. The sort free Architecture II schedules in 16 cycles and achieves 231 MHz clock frequency. Therefore, single Architecture II detector is enough to achieve the throughput of 347 Mbps. Architecture I and II are compared in Table 3 . Four Architecture I blocks in parallel achieve the target throughput with less resources than the sort-free Architecture II. Furthermore, it also adds the value of scalability for the design. [31] claims that the architecture without traditional sorter for the K-best algorithm is adequate when K is smaller than the number of constellation points. Here the K is 8 and the number of constellation points in real valued 64-QAM system is also 8. Thus, our system parameters create somewhat a borderline case for the comparison.
Regarding the complexity distribution of the discussed receiver structure the following notes were made for the blocks shown in Figure 4 . QRD has very low throughput requirements compared to SD. QRD is performed only once while SD tree search algorithm will run 1200 times in the meantime. This enables high latency/low complexity directives for the design and in our HLS implementation QRD complexity is only 15-20% of the 8-best SD. Matrix multiplication has low complexity compared to the tree search. The de-mapper complexity depends on the number of remaining ED paths after the last level of the tree search. The number of remaining paths in the 8-best and 16-best tree search algorithms is 8 and 16, respectively. Likewise, in SSFE[8,8,1,1,1,1,1,1] and SSFE[4,3,2,2,1,1,1,1 ] tree search algorithms, the number of remaining paths is 64 and 48, respectively. Therefore, the de-mapper for SSFE[8,8,1,1,1,1,1,1 ] is almost eight times more complex than the de-mapper for the 8-best tree search algorithm. The same is true also for the LLR algorithm. In our 8-best scenario, the de-mapper and LLR complexity is less than 10% of the SD algorithm. 
Discussion
We compared different receiver algorithms and structures for SC-FDMA uplink transmission. The novel frequency domain MMSE equalization with sphere detection receiver is a remarkable improvement over the conventional linear MMSE receiver. The K-best LSD algorithm and the SSFE algorithm were considered as possible tree search algorithms for this receiver. Two different list sizes were used for the K-best algorithm and two different node spanning vectors for the SSFE algorithm. As a result, K-best algorithm with a list size of 8 was considered to give the best performance-complexity ratio and was chosen for the implementation. The 8-best LSD algorithm was implemented on a Xilinx Virtex-6 FPGA with Xilinx Vivado High Level Synthesis tool using several optimization methods. The used HLS implementation was used to compare two architectures with each other. The design process and the amount of effort used for both of these implementations were identical. Both architectures might have potential for further optimization with traditional design methods. Yet, we assume that with hand-written RTL implementation the same conclusions about the benefits of both architectures could be made.
The target throughput was 347 Mbps. HLS tool enabled us to implement two different equally optimized architectures with moderate amount of work. The Architecture I, including conventional sorter, schedules in 64 cycles and achieves 93 Mbps detection rate. However, the SC-FDMA receiver allows parallel processing of the subcarriers and four 93 Mbps designs can be used to achieve the throughput target of 347 Mbps. The Architecture II, without conventional sorter, schedules in 16 cycles and achieves 347 Mbps with one design. Hence, the equalization algorithms and their realizations on an FPGA fulfills both the latency and performance requirements of LTE/LTE-A base stations with 64-QAM and 4 × 4 MIMO set-up. The recommendations for the most suitable algorithm and architecture were made based on performance and implementation complexity. Due to uplink scenario and RF front-end domination in base station energy consumption, the energy efficiency was not used as factor.
The sort-free architecture did not give any gain and was actually more complex than the architecture including a sorter. The sort-free architecture would be efficient only if K < number of constellation points. This is often the case when operating on the complexvalued constellation points. In our system K=8 and the number of constellation points was also 8. Based on the simulation results the efficient value for K in common 4×4 64-QAM system is somewhat minimum of 8. There are basically two options to get gain from the sort-free architecture in a common 4 × 4 64-QAM system. Either K value should be dropped down to e.g. 4 or complex valued tree search should be applied. In 64-QAM system this would change the number of constellation points from 8 to 64. Although, complex valued processing would create other implementation challenges. The trade-off in the performance by reducing the K value or applying the complex valued tree search as well as determination of how much smaller the K should be to gain significant difference in the complexity, would require further study.
Conclusion
Our objective was to give a recommendation for an efficient and realistic receiver structure, detector algorithm and detector implementation architecture for 4 × 4 64-QAM MIMO LTE-A systems and their SC-FDMA based uplink base stations. The Frequency domain linear MMSE filter with sphere detection was chosen as the receiver structure based on our earlier work in [23] . The 8-best LSD algorithm was chosen as the detector algorithm based on the analysis done in this paper. With these system parameters a sort-free implementation architecture for 8-best LSD algorithm is not recommended. Thus, for practical realizations our recommendation is to focus on optimizing the conventional 8-best 4 × 4 64-QAM architecture without trying to avoid the sorting operation.
