Abstract-Soft-output detection of a multiple-input-multiple-output (MIMO) signal pose a significant challenge in future wireless systems. In this paper, we introduce a soft-output modified metric first (MMF)-LSD algorithm for MIMO detection. We design a scalable architecture and address a method to decrease memory requirements. We provide implementation results for a spatial multiplexing (SM) system with four transmitted streams and with 16-and 64-quadrature amplitude modulation (QAM) on a 0.18-m CMOS application specific integrated circuit (ASIC) technology. The MFF-LSD implementation is more efficient than the depth first (DF) -LSD in the crucial low signal-to-noise rate (SNR) region and the detection rate of the 64-QAM implementation is 39.2 Mbps@26 db with 48.2 kGEs complexity.
I. INTRODUCTION
Multiple-input-multiple-output (MIMO) techniques in combination with orthogonal frequency-division multiplexing (MIMO-OFDM) have been identified as a promising approach for high spectral efficiency wideband systems. The optimal maximum a posteriori (MAP) detector for MIMO system with forward error correction (FEC) coding is often too complex for systems with high order modulation. Suboptimal linear detectors [1] offer low complexity solutions, but have rather poor performance in correlated fading channels. A list sphere detector (LSD) [2] is a soft output variant of the sphere detector [3] that can be used to approximate the MAP detector with much lower computational complexity [2] .
The SD algorithms are often divided according to their search strategy into the breadth first (BF), the depth first (DF), and the metric-first (MF) algorithms [4] . The BF algorithms [5] are implementation friendly, but have suboptimal performance. The DF algorithms [6] are more efficient in terms of visited nodes compared to the BF algorithms, but the algorithms have a variable search complexity and, thus, they are difficult to implement efficiently. The MF algorithms [4] are optimal in terms of the number of visited nodes, but require that the visited nodes are maintained in metric order to ensure the optimality, which requires the usage of memory and sorting [4] . Various sphere detector designs and implementations have been introduced, e.g., in [5] , [7] - [9] . In this paper, we introduce an implementation friendly soft-output modified MF (MMF) -LSD algorithm. We design an efficient and scalable architecture for the MMF-LSD and address the implementation trade-offs. We provide a scalable implementation for a spatial multiplexing (SM) system with up to 4 transmitted streams and 16-and 64-quadrature amplitude modulation (QAM) constellations on a 0.18-m CMOS application specific integrated circuit (ASIC) technology. We present the synthesis and power results of the MMF-LSD implementation and compare it to a DF-LSD. As far as the authors know, there has been no other architecture designs or implementations of the MF-based detectors in the literature. This paper is organized as follows. The signal model and the MMF-LSD algorithm are presented in Section II. The architecture is introduced in Section III, and the implementation tradeoffs are discussed in Section IV. The implementation results are introduced and discussed in Section V. The conclusions are drawn in Section VI.
II. MIMO SIGNAL DETECTION
An OFDM-based SM system is considered with NT transmit (TX) antennas and N R receive (RX) antennas with the assumption N R N T and with QAM. A real signal model is assumed with the real dimensions MT = 2NT, MR = 2NR and the real symbol alphabet R . The received signal can be expressed in the real domain as [3] y = Hx +
where the received signal vector y, the transmit symbol vector x, and the noise vector are defined in the frequency domain. of x M i can be calculated as [7] d
where d(x M M ) = 0, b i+1 (x M i+1 ) =ỹ i 0 M j=i+1 R i;j x j , R i;j is the (i; j)th term of R and i = MT; ...; 1. The LSD output candidate list is then used to approximate the soft outputs [2] . A computationally efficient max-log-MAP approximation of the log-likelihood ratio (LLR) of the kth transmitted bit b k is calculated as [2] 
where k;1 = fxjb k = 1g include the bit vectors in the candidate set
We propose a MMF-LSD algorithm, which is a modification from the increasing radius (IR)-LSD in [10] . We include a maximum limit for the algorithm iterations D max to fix the variable search complexity and transform the algorithm to be more suitable for implementation. The algorithm uses two memory sets: the final candidate memory L with the size of N cand candidates and the partial candidate memory S with the size of D max candidates. The MF search requires that the partial candidates are stored in S and the candidate with the minimum PED is extended on each iteration. We also propose to use a novel memory sphere radius C mem to decrease the number of stored candidates and the complexity of the required min search. The extended partial candidates N are compared to the C mem and stored to the memory S only if d(s) < Cmem. We define Cmem based on the previously solved candidate(s) in the final list(s) with minimum ED minx2L(d(x)), which is then scaled with a determined radius scaling variable W R to store only the potential partial candidates to the partial memory set S . The minimum ED values can be averaged over time and frequency, i.e., OFDM subcarriers, and then the memory sphere radius can be written as
The impact of Cmem on complexity and performance will be illustrated with numerical examples in Section IV.
III. ARCHITECTURE DESIGN

A. MMF-LSD Algorithm
The MMF-LSD algorithm architecture, which operates in a sequential fashion, includes a tree pruning unit (TPU), a partial candidate memory unit, a final candidate memory unit, and a control logic (CNTR) unit as illustrated in Fig. 1 .
1) TPU Unit:
The TPU has two similar candidate extension modules, which execute the tree pruning for two nodes in parallel and can be divided into two sub-units as illustrated in Fig. 2 . The first unit The enumeration is designed in a modified fashion from the way presented in [3, (14) ]. Instead of calculating the costly and high latency division operation, we calculate the absolute value in (3) with j R j different symbols xi. The degree of parallelism should be decided depending on the slowest parallel unit in the whole MMF-LSD algorithm architecture to optimize the performance.
2) Memory Units: The memory units are designed as binary heap [11] data structures, which keep the stored elements in order according to the selected cost metric. The partial candidate memory set S is implemented as min-heap, where the elements N (s; d(s); n2; i) are ordered so that the candidate with the minimum PED is always sorted to be at the top of the heap, and the final memory set L F is implemented as max-heap. The storing of a new element requires a time complexity of O(log 2 (k)) in the worst case [11] , where k is the size of the memory. The size of the partial candidate memory S is equal to Dmax elements as at maximum the minimum candidate is removed and two candidates are added to the heap in each iteration. We modified the traditional heap sorting logic to limit unnecessary memory access. The possible new partial candidate(s) (child and father) are first compared to the minimum candidate, and if the candidate on the top of the heap has the minimum PED, the first stored candidate is located at the top of the heap and sorted via the down-heap operation [11] . Otherwise, the new candidate is added to the next free memory address and the heap is sorted via the up-heap operation [11] . We also apply the memory sphere radius C mem to decrease the amount of memory access as the updated candidates are discarded if d(s) < C mem . The partial memory unit microarchitecture with up-and down-heap logic is illustrated in Fig. 3 .
3) CNTR Unit and Data Flow: The control logic unit includes an iteration counter for the MMF-LSD algorithm and determines the candidates to be stored in the memory and to be used in the search in the next algorithm iteration. The candidate to be used in the TPU unit in the next iteration is determined as the candidate with minimum PED from the extended candidates Nc and N f , and the minimum candidate in partial memory S 0 . If either one of the extended candidates N c or N f is selected for the next algorithm iteration, S 0 remains in the memory. Thus, unnecessary memory access is minimized as the candidates N c and N f are not directly stored in the memory. The extended partial candidate(s) to be stored in S are also conditioned with Cmem to minimize memory access. The data flow is designed to minimize the latency in one algorithm iteration by introducing parallel operations. The straightforward data flow would first extend the new candidates, then store them in memory units, and finally determine the new candidate for the next iteration. However, the data flow can be designed more efficiently to reduce the latency as follows: as the control logic unit determines the new candidate for the TPU at D = 2 and the stored candidates for the memory units from D = 1, the TPU and memory units are then executed in parallel, which decreases the latency significantly compared to the straightforward mapping.
B. LLR Calculation Unit
The soft output information L D (b k ) is calculated from the MMF-LSD algorithm output list L by using the max-log-MAP approximation as in (4). The microarchitecture can be divided into two main parts: the scaling of the ED values and the search for maximum (2 2 the maximum values for both bit counterparts. Thus, two sequential logic loops are required in the calculation with the final list index m and bit value index k. The latency of the loops can be decreased by applying parallel logic and/or pipelining to check multiple ED values or bits in parallel. It should be noted that the possibility for parallel implementation of the logic is a clear benefit of the max-log-MAP approximation compared to the log-MAP algorithm. The problem of inaccurate approximation can be compensated for by limiting the dynamic range of the output LLR variable [2] .
C. Scalability
The MMF-LSD algorithm architecture can be used as such in SM systems with different antenna configurations and constellation sizes jj as N R N T . If some diversity method combined with SM scheme is applied with NT > NR, the receiver signal model should be modified accordingly, e.g., as in [12] . The limit for the number of algorithm A proper D max value depends on the channel realization and on the search tree size, i.e., on the number of independent data streams and the constellation size jj. A larger tree size requires a higher D max value. Memory resources of Dmax elements are reserved for the memory unit S according to the highest supported system configuration. The amount of parallelism and pipelining in the TPU unit can be modified based on latency requirements. However, the TPU unit latency should be optimized to match the memory unit S and its logic, which are executed in parallel, for efficient implementation. The soft output LLR calculation unit can be used as such for different system configurations as it operates separately from the MMF-LSD algorithm. Multiple MMF-LSD algorithm units can be used in parallel to support higher data rate requirements. The scalability of the MMF-LSD is further illustrated with examples in Sections IV and V.
IV. IMPLEMENTATION TRADEOFFS
The MF algorithms as such are typically not suitable for low cost implementation. Therefore, we apply the limited search variable Dmax and memory sphere radius C mem in the MMF-LSD algorithm and LLR clipping in the LLR calculation to get a tradeoff between complexity and performance. It has also been shown that the detection order of the transmitted spatial streams affects the number of visited nodes [5] . Thus, we assume the use of sorted QRD (SQRD) [13] processing of the channel matrix H prior to the LSD algorithm, where the ordering of the spatial layers is included into a modified Gram-Schmidt decomposition process. The SQRD algorithm leads to close to optimal detection order so that the strongest signal is located at the top of the sphere search tree [13] . The SQRD of the channel matrix should be updated for all OFDM subcarriers within the channel coherence time. We do not focus on the SQRD implementation in more detail in this paper, but implementations of SQRD have been done, e.g., in [14] .
We performed computer simulations to verify the feasible tradeoffs between performance and complexity of the MMF-LSD. A turbo The results are presented in frame error ratio (FER) versus signal-tonoise rate (SNR) in a Winner B1 channel [15] in Fig. 4 . The performance of the MMF-LSD is also compared to that of the DF-LSD [2] , linear minimum mean squared error (LMMSE) and ML detectors. We also studied the affect of the memory sphere radius C mem on the MMF-LSD performance and determined a proper value for WR to be used. We studied the complexity reduction as the average number of ex- Table I . Also the average number of up-and down-heap operations are decreased, which significantly reduces the required memory access and the latency of the heap sorting.
V. IMPLEMENTATION RESULTS
The soft-output MMF-LSD algorithm was implemented for a SM The implementation of the MMF-LSD algorithm was targeted to a 0.18-m CMOS ASIC technology using ANSI C++ language. The Mentor Graphics' Catapult C Synthesis tool was applied to produce the RTL description. The synthesis was done with Synopsys Design Compiler for 250 MHz frequency and the power usage was estimated with Synopsys Prime Power tool. The detailed synthesis and power usage results are shown in Table II . The ASIC complexity is given in gate equivalents (GEs), where one GE corresponds to the area of a two-input drive-one NAND gate. The TPU unit is the most complex unit, while the others require only a minor part of the total resources. The TPU unit for 64-QAM is implemented with 2 and 4 parallel and pipelined MULs in the subunits, while the unit for 16-QAM is implemented with 2 and 2 MULs, to enhance the more demanding processing due to higher constellation and word lengths. The partial memory size is D max candidate words and it is implemented with dual port memory to enhance the performance. The total power usage is P = 56.5 mW and P = 90.3 mW for the MMF-LSD algorithm with 16-and 64-QAM, respectively. The latency of a MMF-LSD algorithm iteration consists of the slowest parallel unit with the average number of required operations, the TPU unit, and the control logic. We also implemented a DF-LSD algorithm [2] to have a fair comparison to the MMF-LSD algorithm. The complexity of the MMF-LSD algorithm is more than double in GEs compared to the DF-LSD algorithm mainly due to the larger TPU unit, which enables two studied nodes in one iteration. The more sophisticated search of the MMF-LSD algorithm requires a bit more complex control logic and partial memory unit, but it also requires much less algorithm iterations on average. The LLR calculation unit is implemented with two MUL units and pipelined N cand = 15 parallel comparison units, and results are shown in Table II . The LLR calculation unit is used separately after the MMF-LSD algorithm search is executed, and, thus, the latency is different.
The detection rate R det depends on the iterations D, which should be selected to meet the desired FER target with given channel and SNR. The detection rate of the MMF-LSD algorithm is listed and compared to the DF-LSD algorithm implementation with low SNR = 15/21 dB and with high SNR = 21/26 dB in Table III . The detection rate R det of the MMF-LSD algorithm at low SNR is approximately 4-6 times that of the DF-LSD algorithm with approximately 2-3 times more complexity. The detection rate at high SNR is approximately the same because both detectors compute sufficiently good LLR approximations with only a minimum number of iterations. In practice, the low SNR scenario is more important, because the receiver has to be designed according to the worst case. The impact of D max on R det and FER can be derived from the results. The LLR calculation unit achieve a fixed rate of R (asic) det = 121.2/146.3 Mbps for 16-/64-QAM with 18.5 kGE complexity, respectively. In an OFDM system, one unit can be used for multiple parallel LSD algorithm units. The delay of the detection is the combined latency of the MMF-LSD algorithm and LLR calculation unit for one subcarrier.
A. Comparison to Literature and Discussion
We compare the implementation detection rates at high SNR to other sphere detector implementations for 4 2 4 systems with 16-and 64-QAM constellations in Table IV . It can be seen that our implementation is competitive with the other implementations. Especially the complexity and power usage is lower compared to the K-best implementations. It can be also noted that the detection rate of our implementation is competitive to the other presented designs even though the main advantage of the MMF-LSD is at low SNR. The main limiting factor of our implementation for higher throughput is the latency of one algorithm iteration, and the TPU unit, which could be enhanced, e.g., by using more advanced ASIC technology.
VI. CONCLUSION
We considered detection based on the MF search strategy and introduced a soft-output MMF-LSD algorithm, which is more suitable for implementation. We introduced an architecture for the algorithm and showed that the main operations of the algorithm can be run in parallel and they are scalable to different system configurations with minor changes. We also introduced the memory sphere radius, which reduces the memory access requirements and decreases the average number of visited nodes. An optimized architecture was implemented as scalable for a SM system with up to 4 transmitted streams and 16-and 64-QAM constellations on a 0.18-m CMOS ASIC. The results show that the MMF-LSD algorithm is more efficient compared to the DF-LSD at low SNR.
