Abstract-Multiple-input multiple-output (MIMO) detection algorithms providing soft information for a subsequent channel decoder pose significant implementation challenges due to their high computational complexity. In this paper, we show how sphere decoding can be used as an efficient tool to implement softoutput MIMO detection with flexible trade-offs between computational complexity and (error rate) performance. In particular, we provide VLSI implementation results which demonstrate that single tree-search, sorted QR-decomposition, channel matrix regularization, log-likelihood ratio clipping, and imposing runtime constraints are the key ingredients for realizing soft-output MIMO detectors with near max-log performance at a chip area that is only 58% higher than that of the best-known hard-output sphere decoder VLSI implementation.
I. INTRODUCTION
M ULTIPLE-input multiple-output (MIMO) wireless systems employ multiple antennas on both sides of the wireless link and offer increased spectral efficiency (compared to single-antenna systems) by transmitting multiple data streams concurrently and in the same frequency band (spatial multiplexing). MIMO technology constitutes the basis for upcoming wireless communication standards, such as IEEE 802.11n and IEEE 802.16e.
The main challenge in the practical realization of MIMO wireless systems lies in the efficient implementation of the detector which needs to separate the spatially multiplexed data streams. To this end, a wide range of algorithms offering various trade-offs between performance and computational complexity have been developed [2] . Linear detection producing hard outputs constitutes one extreme of the complexity/performance trade-off region, while computationally demanding maximum-likelihood (ML) detection algorithms in combination with exact a posteriori probability (APP) computation result in the opposite extreme. In general, the computational complexity of a MIMO detection algorithm depends Manuscript received April 7, 2007; revised October 31, 2007 . This paper was presented in part in [1] on the symbol constellation size and the number of spatially multiplexed data streams, but often also on the instantaneous MIMO channel realization and the signal-to-noise ratio (SNR).
On the other hand, the overall decoding effort is typically constrained by the system bandwidth, latency requirements, and the quest to keep chip area and power consumption as low as possible. Implementing different algorithms, each optimized for a maximum allowed decoding effort and/or a particular system configuration, would entail considerable chip area overhead and in addition be highly inefficient since large portions of the chip would remain idle most of the time. A practical MIMO receiver design should therefore be able to cover a wide range of complexity/performance trade-offs using a single tunable detection algorithm.
Contributions:
In this paper, we describe a tunable MIMO detector based on the sphere decoder [3] - [8] , with performance ranging from that of hard-output successive interference cancellation (SIC) [9] to that of max-log APP detection [10] . Tuning of the detector is achieved through loglikelihood ratio (LLR) clipping, channel matrix regularization, and imposing constraints on the maximum computational complexity of the decoder (i.e., run-time constraints). With a view towards VLSI implementation, we elaborate on, and provide refinements of, the tree-search algorithm outlined in [11] leading to what we term the single tree-search (STS) approach. We describe how LLR clipping as proposed in [12] can be incorporated into the STS algorithm. A framework for systematically characterizing the complexity/performance trade-offs of the resulting class of soft-output sphere decoders is formulated. Finally, we present a suitable VLSI architecture and provide reference implementation results for max-log softoutput sphere decoding with LLR clipping.
Notation: Matrices are set in boldface capital letters, vectors in boldface lowercase letters. The superscripts T and H stand for transpose and conjugate transposition, respectively. We Outline of the Paper: The remainder of this paper is organized as follows. Section II reviews the transformation of the max-log soft-output MIMO detection problem into 0733-8716/08/$25.00 c 2008 IEEE a series of tree-search problems. In Section III, we review the repeated tree-search (RTS) algorithm proposed in [13] and introduce the STS algorithm. In Section IV, we describe methods for reducing the tree-search complexity both in the RTS and the STS algorithms. A framework for evaluating the complexity/performance trade-offs of the resulting family of soft-output sphere decoders is introduced in Section V. In Section VI, we describe a VLSI architecture for the efficient implementation of max-log soft-output sphere decoding with LLR clipping. Corresponding ASIC implementation results are summarized in Section VII. We conclude in Section VIII.
II. SOFT-OUTPUT SPHERE DECODING
Consider a MIMO system with M T transmit and
MT , where O stands for the set of underlying complex-valued scalar constellation points with |O| = 2 Q . Each symbol vector s is associated with a bit-level label vector x where, throughout the paper, symbol vectors and their associated labels will be used interchangeably. Slightly deviating from our notation rules, we denote the entries of x as x j,b , where the indices j and b refer to the bth bit in the label of the constellation point corresponding to the jth entry of s = s 1 s 2 · · · s MT T . The resulting complex baseband input-output relation is given by
where H denotes the M R × M T channel matrix and n is an i.i.d. zero-mean proper complex Gaussian distributed M R -dimensional noise vector with variance N o per complex entry. Throughout this paper, we assume that the receiver has perfect knowledge of the channel matrix realization. The SNR per receive antenna is 1/N o .
A. Computation of the Max-Log LLRs
Soft-output MIMO detection requires the computation of LLRs, denoted as L(·), for all bits in the label x. In order to reduce the corresponding computational complexity, we employ the max-log approximation [10] , [14] 
where X
j,b and X (1) j,b are the sets of symbol vectors that have the bth bit in the label of the jth scalar symbol equal to 0 and 1, respectively. Note that we do not take into account a priori information. The max-log approximation entails a performance loss compared to using the exact LLRs. For the simulation setup considered in Section V, this loss, in terms of SNR, is found to be around 0.25 dB over a large range of SNRs. We furthermore emphasize that the LLRs in (2) are normalized by the noise variance N o in order to get rid of the factor 1/N o on the right hand side (RHS) of (2) . This simplifies the exposition and does not degrade the error rate performance of the max-log-based soft-input Viterbi decoder considered in Section V.
For each bit, one of the two minima in (2) is given by the metric λ ML = y − Hs ML 2 associated with the ML solution of the MIMO detection problem
The other minimum in (2) can be written as
where the counter-hypothesis x ML j,b denotes the binary complement of the bth bit in the label of the jth entry of s ML . With (3) and (4) the max-log LLRs can be written as
From (5) we can conclude that efficient max-log APP MIMO detection reduces to efficiently identifying s ML , λ ML , and λ
for j = 1, 2, . . . , M T and b = 1, 2, . . . , Q [13] .
B. Max-Log APP MIMO Detection as a Tree Search
Transforming (3) and (4) into tree-search problems and using the sphere decoding algorithm [3] - [8] allows to efficiently compute the LLRs in (5) . To this end, the channel matrix H is first QR-decomposed according to H = QR, where the M R × M T matrix Q is unitary, and the M T × M T uppertriangular matrix R has real-valued positive entries on its main diagonal. Left-multiplying (1) by Q H leads to the modified input-output relatioñ y = Rs + Q H n withỹ = Q H y and hence, noting that Q H n has the same statistics as n, to the equivalent characterization of λ ML and λ
We next define the partial symbol vectors (PSVs)
T and note that they can be arranged in a tree that has its root just above level i = M T and leaves, on level i = 1, which correspond to symbol vectors s. In the following, the label associated with s (i) is denoted by (6) and (7) can be computed recursively as d(s) = d 1 with the partial Euclidean distances (PEDs)
the initialization d MT +1 = 0, and the distance increments (DIs)
Since the dependence of the PED d i on the symbol vector s is only through the PSV s (i) , we have transformed ML detection and the computation of the max-log LLRs into a weighted tree-search problem: PSVs and PEDs are associated with nodes, branches correspond to DIs. For brevity, we shall often say "the node s (i) " to refer to the node corresponding to the PSV s (i) . Each path from the root down to a leaf corresponds to a symbol vector s ∈ O MT . The solution of (6) and (7) corresponds to the leaf associated with the smallest
, respectively. The basic building block underlying the two tree traversal strategies described in the next section is the Schnorr-Euchner (SE) sphere decoder (SESD) with radius reduction [15] , [16] , briefly summarized as follows: The search in the tree is constrained to nodes which lie within a radius r aroundỹ and tree traversal is performed depth-first, visiting the children of a given node in ascending order of their PEDs. The basic idea of radius reduction is to start the algorithm with r = ∞ and to update the radius according to r 2 ← d(s) whenever a leaf s has been reached. This avoids the problem of choosing a suitable initial radius and still leads to efficient pruning of the tree.
III. TREE-TRAVERSAL STRATEGIES
Computing the LLRs in (5) requires determining the metrics λ . Since this computation has to be carried out for every bit, it is immediately obvious that LLR computation results in an order of magnitude increase in computational complexity compared to hard-output sphere decoding. The situation is further exacerbated by the fact that forcing the SESD into subtrees, when computing the minima in (7) leads to significantly less efficient tree pruning behavior, which finally results in an overall complexity increase (over hard-output SESD) of two orders of magnitude. The STS algorithm introduced below is key in reducing this computational complexity.
In the following, we discuss two tree traversal strategies for solving (6) and (7). The first approach described below was introduced in [13] and will be referred to as repeated tree search (RTS). The second algorithm builds on a tree traversal strategy outlined in [11] . With a view towards VLSI implementation, we propose refinements of the approach in [11] resulting in what we call the single tree-search (STS) strategy.
A. Repeated Tree Search (RTS)
The basic idea of the RTS algorithm described in [13] is to start by solving (6) (using the SESD) and to then rerun the SESD to solve (7) for each bit (i.e., QM T times) in the symbol vector. When rerunning the SESD to determine λ ML j,b in (7), the search tree is prepruned by forcing the decoder to exclude all nodes from the search for which x j,b = x ML j,b . For BPSK, this prepruning procedure is illustrated in Fig. 1 . Following the proposal in [13] and initializing the SESD with r = ∞ in each of the QM T runs required to obtain λ ML j,b , will lead to significant computational complexity. It is therefore important to realize that (without compromising max-log optimality) the search radius r j,b can be initialized by setting it equal to the minimum value of ỹ − Rs over all s ∈ X
found during preceding tree traversals.
The main disadvantages of the RTS are: i) the repeated traversal of large parts of the tree which entails a large number of redundant computations; ii) significantly less efficient pruning behavior when computing the λ ML j,b caused by the need to minimize over
. The underlying reason is that pruning efficiency decreases significantly when forcing the sphere decoder through specific branches at levels further down the tree. As noted in [17] , the problem in ii) can partly be mitigated by changing the detection order in each run. The resulting need for multiple QR decompositions, however, leads to an overall increase in terms of hardware complexity.
B. Single Tree Search
The key to more efficient (compared to RTS) tree traversal is to ensure that every node in the tree is visited at most once. This can be accomplished by searching for the ML solution and all counter-hypotheses concurrently. The basic idea behind such an approach has been outlined in [11] . In the following, with a view towards VLSI implementation, we provide refinements of the idea in [11] . Specifically, we formulate update rules and a pruning criterion based on a list containing the metric λ ML , the corresponding label x ML , and the metrics λ ML j,b . The main idea is to search the subtree originating from a given node only if the result can lead to an update of at least one of the metrics in the list, i.e., either λ ML or one of the λ ML j,b . In the ensuing discussion, the current ML hypothesis and the corresponding metric are denoted by x ML and λ ML , respectively.
1) List Administration:
The algorithm is initialized with
. Whenever a leaf with corresponding label x has been reached, the decoder distinguishes between two cases: i) If a new ML hypothesis is found, i.e.,
In other words, for each bit in the ML hypothesis that is changed in the process of the update, the metric of the former ML hypothesis becomes the metric of the new counterhypothesis. This procedure ensures that at all times λ
is the metric associated with a valid counter-hypothesis to the current ML hypothesis. ii) In the case where d(x) ≥ λ ML , only the counterhypotheses have to be checked. For all j and b such that
2) Pruning criterion:
The second key aspect of the STS algorithm is the following tree pruning criterion. Consider a given node s (i) (on level i) and the corresponding partial label x (i) consisting of the bits
. , Q).
Assume that the subtree originating from the node under consideration and corresponding to the bits x j,b (j = 1, 2, . . . , i − 1, b = 1, 2, . . . , Q) has not been expanded yet. The pruning criterion for s (i) along with its subtree is compiled from two conditions. First, the bits in the partial label x (i) are compared with the corresponding bits in the label of the current ML hypothesis x ML . All metrics λ
with x j,b = x ML j,b found in this comparison may be affected when searching the subtree of s (i) . Second, the metrics λ
. . , Q) corresponding to the counter-hypotheses in the subtree of s (i) may be affected as well. In summary, the metrics which may be affected during the search in the subtree emanating from the node s (i) are given by the set
The node s (i) along with its subtree is pruned if its PED
This pruning criterion (illustrated in Fig. 2 ) ensures that a given node and the entire subtree originating from that node are explored only if this could lead to an update of either λ ML Fig. 3 . LLR clipping reduces the search radius to rmax = λ ML + Lmax around the received pointỹ.
or of at least one of the λ
IV. METHODS FOR COMPLEXITY REDUCTION
So far, we discussed strategies which solve (2) exactly and hence do not compromise performance of the max-log APP decoder. The goal of this section is to describe methods, again with a view towards VLSI implementation, that allow to tradeoff decoder complexity with (error rate) performance.
The complexity measure employed throughout this paper is the number of nodes (including the leaves, but excluding the root) visited by the decoder. In Section VI, we show that this simple complexity measure provides a good indication for the complexity of a corresponding VLSI implementation.
A. LLR Clipping
The dynamic range of LLRs is typically not bounded. However, practical systems need to constrain the magnitude of the LLR values to enable fixed-point implementation. Evidently this will lead to a performance degradation. A straightforward way of ensuring that LLR values are bounded is to clip them after the detection stage so that
We emphasize that the constraint in (11) refers to the normalized LLRs L(x j,b ) as defined in (2) so that L max is a normalized maximum LLR value. a) LLR Clipping for RTS: It has been noted in [12] that (11) can be built into the RTS algorithm as a constraint leading to a reduction in search complexity. The basic idea is to recognize that (5) together with (11) results in an upper bound on the radius r j,b (as illustrated in Fig. 3 ). To this end, r j,b is initialized as described in Section III-A followed by an immediate update according to
which ensures that (11) is satisfied. Note that as a consequence of (12) , metrics associated with counter-hypotheses for which no valid lattice point is found equal λ ML + L max . b) LLR Clipping for STS: LLR clipping can be built into the STS algorithm by simply applying the update
after carrying out the steps in Case i) of the list administration procedure described in Section III-B. The remaining steps of the STS algorithm are not affected. Both in the RTS and the STS algorithm, for L max = ∞, we obviously get the exact max-log LLRs, whereas for L max = 0, we obtain hard-output SESD performance as the decoder output is x ML , λ ML , and L(x j,b ) = 0 for all j and b. Smaller values of L max lead to more aggressive pruning of the tree and hence to reduced search complexity. We shall see in Section V that as we reduce L max , the decoder performance degrades gracefully, eventually resulting in hard-output ML performance. The parameter L max can therefore be used to efficiently adjust the detection complexity/performance tradeoff. We conclude by noting that LLR clipping as described above goes beyond the initial motivation of constraining the word-width used to represent LLR values in binary logic.
B. Sorting and Regularization
Sorting: A common approach to reduce complexity in sphere decoding without compromising performance is to adapt the detection order of the spatial streams to the instantaneous channel realization by performing a QR-decomposition on HP (rather than H), where P is a suitably chosen M T × M T permutation matrix. More efficient pruning of the search tree is obtained if sorting is performed such that "stronger streams" (in terms of effective SNR) correspond to levels closer to the root, i.e., if P is chosen such that the main diagonal entries of R in HP = QR are sorted in ascending order. Solving this problem exactly would result in prohibitive complexity. A heuristic algorithm resulting in a good complexity/performance trade-off was proposed in [18] and will be referred to as sorted QR-decomposition (SQRD) in the following.
Regularization: Poorly conditioned channel realizations H lead to high search complexity due to the low effective SNR on one or more of the effective spatial streams. An efficient way to counter this problem is to operate on a regularized channel matrix by computing the (sorted) QR-decomposition of
where α is a suitably chosen regularization parameter, Q is a
whereỹ = Q H 1 y ands = Ps. The LLRs in (15) need to be reordered at the end of the detection process to account for the permutation induced by P.
Note that even though Q is unitary, Q 1 will, in general, not be unitary, which is the reason for the LLRs in (15) being an approximation to the exact (max-log) LLRs in (2). The basic idea underlying this approximation is to perform the QRdecomposition on the regularized channel matrix and to apply the result to the physical channel matrix. In the following, we provide a qualitative discussion of the error incurred by this procedure. We start by noting that, as a consequence of (14), we haveỹ = Rs +ñ (16) with the effective noise-plus-(self)-interference (NPI) vector
1
MT I MT , straightforward manipulations reveal that
Setting α = ± √ N o M T corresponds to an MMSE regularization [19] , results in K = N o I MT , and yields a good performance/complexity trade-off. We emphasize, however, that setting α = ± √ N o M T will not render the effective NPI vector n Gaussian. In the remainder of the paper, we denote the QRdecomposition in (14) with α = ± √ N o M T as MMSE-SQRD. An important practical aspect of MMSE-SQRD results from the fact that the noise variance N o has to be estimated. We found that, in general, even slight overestimation of N o will lead to a noticeable performance degradation, whereas slight underestimation does not seem to constitute a problem.
C. Run-Time Constraints
The computational complexity (required to find the ML solution and the LLR values) of the algorithms discussed so far depends on the realization of the random channel matrix as well as on the noise realization. Consequently, the decoder throughput is variable, which constitutes a problem in many practical application scenarios. Moreover, the worstcase complexity corresponds to an exhaustive search. In order to meet the practically important requirement of a fixed throughput, the algorithm run-time must be constrained. This, in turn, leads to a constraint on the maximum detection effort or, equivalently, a constraint on the maximum number of nodes the sphere decoder is allowed to visit. Clearly, this will, in general, prevent the detector from achieving ML or max-log APP performance. It is therefore important to find a way of imposing run-time constraints while keeping the resulting performance degradation at a minimum. Moreover, in practice, it is highly desirable to have a smooth performance degradation as the run-time constraint becomes more stringent.
In the following, we restrict ourselves to the STS algorithm. A straightforward way of enforcing a run-time constraint is to terminate the search, on a symbol vector by symbol vector basis, after a maximum number of visited nodes. The STS decoder then returns the best solution found so far, i.e., the current ML and counter-hypotheses. A better solution is to impose an aggregate run-time constraint of N D avg visited nodes for an entire block of N symbol vectors 1 [20] . The maximum number of visited nodes allocated to the detection of the kth symbol vector can, for example, be chosen according to the maximum-first (MF) scheduling strategy as
for k = 1, 2, . . . , N, where D(i) denotes the actual number of visited nodes for the ith symbol vector. The concept behind (17) is that decoding a given symbol vector is allowed to consume all of the remaining run-time within the block of N symbol vectors up to a safety margin of (N − k)M T visited nodes. This margin allows to find at least the hard-output SIC solution for all remaining symbol vectors. Setting D avg = M T maximizes the throughput, but reduces the performance to that of hard-output SIC. We emphasize that, under run-time constraints, there may be LLRs at the end of the decoding process that have not been updated from their initial value of ∞ and hence need to be set to L max .
V. PERFORMANCE/COMPLEXITY TRADE-OFFS
System engineers typically face the problem of designing a receiver that achieves a prescribed target frame error rate (FER) at a prescribed throughput. The quality of the receiver implementation can then be measured by the minimum SNR required to achieve this target FER at the specified throughput. In the following, we assess the complexity/performance tradeoffs of the receiver concepts described in Sections III and IV by plotting the average (over independent channel and noise realizations) number of visited nodes as a function of this minimum SNR. Since the number of visited nodes is related to the required chip area per throughput [16] , the corresponding 1 In an OFDM-based MIMO system, N would, for example, be the number of OFDM tones.
results allow to associate a reduction in hardware complexity (e.g., chip area) to an SNR penalty.
All simulation results below are for a rate R = 1/2 (generator polynomials [133 o 171 o ] and constraint length 7) convolutionally encoded MIMO-OFDM system [21] with M T = M R = 4, 16-QAM constellation (using Gray mapping), N = 64 tones, and soft-input Viterbi decoding [22] . Note that for L max = 0, one has to employ a hard-input Viterbi decoder. One frame consists of 1024 randomly interleaved (across space and frequency) bits and corresponds to one OFDM symbol. A TGn type C channel model [23] is used and in all simulations, SNR is the per receive antenna SNR. Fig. 4 compares the performance of RTS and STS maxlog APP decoders, and the list sphere decoder (LSD) [10] for different target FERs and different values of L max . In the case of the LSD, changing the list size allows to adjust the complexity/performance trade-off.
A. Comparison of Tree-Search Strategies
The STS algorithm is seen to outperform the RTS strategy in terms of average complexity by a factor of 4 to 8.
The implementation of the LSD requires memory and logic for the administration of the candidate list, not accounted for in this comparison. Fig. 4 shows that even when this additional complexity is ignored, the STS is still superior to the LSD. Under stringent complexity constraints the STS shows SNR advantages over the LSD of up to 1.5 dB. Specifically, we show the trade-off curves corresponding to SQRD, MMSE-SQRD, and standard (unsorted) QRD at a target FER of 0.01. It can be seen that the improvement resulting from sorting (i.e., SQRD vs. QRD) becomes significant for stringent (but realistic) complexity constraints. Further improvements, in the low-complexity regime, are obtained from regularization using MMSE-SQRD. In the high-complexity regime, the performance penalty incurred by regularization (see the discussion in Section IV-B) eventually renders MMSE-SQRD inferior to SQRD.
B. Impact of Sorting and Regularization

C. LLR Clipping
Both Fig. 4 and Fig. 5 show that, as discussed in Section IV-A, adjusting the LLR clipping level L max allows to sweep an entire family of sphere decoders ranging from the exact max-log APP SESD (L max = ∞) to hard-output SESD (L max = 0). It is interesting to observe that aggressive clipping according to L max = 0.2 yields close to maxlog APP performance. Increasing the LLR clipping level beyond this value increases complexity without a noticeable performance improvement. Furthermore, we observe that the decoder performance degrades gracefully as we decrease L max thereby reducing the average search complexity. In summary, the LLR clipping level can be used to conveniently adjust the decoder at runtime to a given complexity constraint. 
D. Run-time Constraints
In Fig. 6 , we demonstrate the impact of imposing a run-time constraint according to a maximum of N D avg visited nodes for a frame of N = 64 symbol vectors using the strategy described in Section IV-C. The resulting curves essentially consist of two regions:
• If the LLR clipping level L max is high (corresponding to high average search complexity), the run-time constrained detector is not able to compute accurate LLR values, which results in (very) poor performance, unless D avg is large. For D avg = 128, the performance is, indeed, very close to that of the unconstrained max-log APP SESD.
• For small L max , the performance is dominated by the impact of clipping rather than by the impact of the runtime constraint. In summary, we can conclude that for a given run-time constraint there exists an optimum LLR clipping level, in the sense of minimizing the SNR required to achieve a certain target FER. It is therefore important to choose the LLR clipping level in accordance with the average run-time constraint.
VI. VLSI ARCHITECTURE FOR MAX-LOG STS SOFT-OUTPUT SPHERE DECODING
Since the proposed max-log STS SESD VLSI implementation is based on the one-node-per-cycle (ONPC) VLSI architecture developed in [16] for hard-output SESD, we start our discussion by briefly reviewing relevant aspects of [16] .
A. A Brief Review of the ONPC Architecture in [16]
The VLSI architecture proposed in [16] employs two functional units:
Metric Computation Unit (MCU): The MCU handles the forward iteration in the search tree by identifying the startingpoint for the SE enumeration (i.e., the current node's child that has the smallest PED) using the direct-QAM enumeration algorithm initially proposed in [10] and slightly modified in [16] . The basic idea behind this enumeration method for QAM constellations is as follows: The QAM constellation is first decomposed into subsets of constellation points that have the same modulus, referred to as phase-shift keying (PSK) subsets. Within each of these PSK subsets, the child associated with the smallest PED can be identified based on the phase of b i =ỹ i − MT j=i+1 R i,j s j only. The corresponding minimum PEDs (one for each subset) are then computed and compared. The minimum PED across subsets identifies the starting point for the SE enumeration. If the resulting child neither corresponds to a leaf nor qualifies for pruning, the decoder proceeds in forward direction by declaring this child as the next parent node to be examined by the MCU (cf. ① in Fig. 7) .
Metric Enumeration Unit (MEU):
The MEU maintains a list of preferred children, one for each node between the root and the parent of the node whose children are currently under examination by the MCU. To this end, the MEU follows the MCU on its path through the tree with one cycle delay. While the MCU visits a node, the MEU considers this node's siblings and identifies the one that should be visited next according to the SE criterion. This sibling is found by applying the direct-QAM enumeration principle described above, where within each PSK subset the next (according to the SE criterion) candidate follows immediately by zig-zag enumeration along the circle. The decision on the preferred child across subsets must again be made by explicit computation and comparison of the smallest PEDs of the individual PSK subsets.
When the forward iteration stalls, either because the child identified by the MCU corresponds to a leaf or must be pruned, the MEU provides a new parent node to the MCU in the next clock cycle (cf. ② in Fig. 7 ). This parent node is chosen by the MEU, following the depth-first paradigm, from those members of the list of preferred children which do not qualify for pruning.
B. VLSI Architecture for STS SESD
The block diagram of the proposed max-log STS SESD VLSI implementation is shown in Fig. 7 . Compared to the architecture for hard-output SESD described in [16] , changes are Fig. 7 . Block diagram of the proposed VLSI architecture for the soft-output STS SESD. Additional units, compared to the hard-output SESD described in [16] , are highlighted. made in the MCU and two additional units are required, one for list administration as described in Section III-B1 and one for the implementation of the pruning criterion as described in Section III-B2. We shall next describe the specifics of these changes.
1) Architectural Changes in the MCU:
From a high-level architectural perspective, there is one fundamental difference between tree-traversal for hard-output SESD and for the STS algorithm: When the node currently examined by the MCU is on the level just above the leaves (i.e., on level i = 2), the hard-output SESD considers only one child, namely the one associated with the smallest PED. The STS algorithm, however, has to compute the PEDs of all children that do not qualify for pruning according to the criterion (10) since these children may lead to updates of the metrics λ ML j,b . To perform this leaf enumeration procedure, the STS decoder must revisit the current node at level i = 2, which requires additional clock cycles and a leaf enumeration unit shown in Fig. 7 . This unit does, however, not require an additional arithmetic unit for the PED computation as it can reuse the PED computation unit in the MCU (cf. ③ in Fig. 7) .
2) List Administration and Tree Pruning: In addition to the modifications in the MCU described above, the STS algorithm requires the following two additional units:
List-Administration Unit (LAU): The LAU is responsible for maintaining and updating the list containing x ML , λ ML , and the λ ML j,b . The corresponding unit is active during the leafenumeration process described above. Since the update rules implemented by the LAU (see Section III-B1) require only a small number of logic operations, the silicon area of this unit is small (see Table II ) and is dominated by the storage space required for the metrics λ ML and λ ML j,b . Pruning Criterion Unit (PCU): The PCU is responsible for computing the reference metrics, i.e., the RHS of (10), required to implement the corresponding pruning criterion. From a VLSI implementation perspective, the reference metric on level i depending on the partial label x (i) constitutes a major problem. More specifically, this dependence causes the criterion for pruning the child of a parent node on level i + 1 to depend on the partial label x (i) of that child. This, in turn, implies that enumeration of the children on level i in ascending order of their PEDs according to the SE criterion can not be applied, which results in the need for exhaustive-search enumeration and is thus ill-suited for VLSI implementation [16] . An adjustment of the pruning criterion in (10) solves this problem. To this end, we define
and prune the node s (i) along with its subtree if d s
Note that the RHS of the modified pruning criterion (18) depends on the partial label x (i+1) rather than on x (i) . Consequently, the enumeration of the children of a node on level i + 1 can be carried out using the SE criterion.
C. Impact of List Administration and Tree Pruning on Complexity
We argued throughout the paper that, for a ONPC architecture, the number of visited nodes is equal to the number of clock cycles required for decoding thus reflecting the true silicon complexity of the algorithm. However, for the proposed STS architecture the number of clock cycles will be larger than the number of visited nodes shown in the numerical results in Section V, for two reasons: First, modifying the pruning criterion (10) to result in (18) leads to less efficient pruning as
The corresponding complexity increase is, however, significantly smaller than what would be incurred if exhaustive search enumeration on (10) would be applied. The second reason for the number of clock cycles being higher than the number of visited nodes is that every time the leaf-enumeration process is performed, one additional cycle is consumed to detect the end of the enumeration process. Consequently, the proposed VLSI architecture no longer strictly follows the ONPC paradigm. The results in Fig. 8 show, however, that the impact of the two effects discussed above leads to the number of clock cycles being only slightly higher than the number of visited nodes.
D. Architectural Considerations for RTS
In this section, we would like to discuss architectural considerations for possible implementations of the RTS strategy. As described in Section III-A, the RTS approach corresponds to repeated runs of a hard-output SESD which, in principle, can be implemented efficiently using the ONPC VLSI architecture introduced in [16] . However, forcing the decoder to search only over the set X
when computing the counter-hypotheses λ ML j,b requires to constrain the search to a subset of admissible constellation points, which, moreover depend on the (bits to symbol) mapping. Consequently, as depicted in Fig. 9 , straightforward zig-zag enumeration can no longer be applied. In addition, as demonstrated in Fig. 9 , different counter-hypotheses will result in different sets of a To provide technology-independent area characterization, the number of gate equivalents (GEs) is specified. One GE corresponds to the area of a twoinput drive-one NAND gate.
b The results on clock frequency are extracted from a post-layout static timing analysis and are representative for the manufactured ASIC within an accuracy of a few percent.
c The area-timing (AT) product of a VLSI circuit is a measure for its true silicon complexity. It is given by the product of the chip area divided by the maximum allowed clock frequency. allowed constellation points, which induces an irregularity that results in an increase in hardware complexity. The problem can be mitigated to a certain extent by adjusting the mapping. However, this, in general, results in a (bit error rate) performance degradation. Alternatively using exhaustive search enumeration, as described in [16] , to compute λ ML and the counter-hypotheses λ ML j,b , is not a viable option as it results in significant overhead in terms of chip area and in an increase in the length of the critical path. For a quantitative analysis of the impact of exhaustive-search enumeration (in hard-output SESD) the interested reader is referred to [16] .
VII. ASIC IMPLEMENTATION RESULTS
In order to assess the true silicon complexity (chip area and achievable clock frequency) of the proposed STS-based soft-output SESD, we implemented the VLSI architecture described in the previous section in 0.25 µm CMOS technology for a MIMO system with M T = M R = 4 using 16-QAM modulation. The resulting chip layout is shown in Fig. 10 . The design parameters of the decoder are summarized in Table I which, for reference, also contains the design parameters of an 2 -norm hard-output SESD, following the design principles employed for the ∞ -norm hard-output SESD described in [16] .
Hardware Complexity: We can see from Table I that the chip area required by the soft-output STS SESD is only 58% higher than that required by a corresponding 2 -norm hardoutput SESD. The detailed area breakdown in Table II shows that most of the area increase results from the LAU, the PCU, and the arithmetic unit that computes the LLRs. Further area increase is due to the need to store the LLRs in the output buffer of the ASIC. The additional Schnorr-Euchner enumeration unit in the MCU required for leaf enumeration adds only 1.9 kGEs to the overall area. The soft-output STS SESD ASIC shows only slightly lower maximum clock frequency than the corresponding hard-output SESD. The reason underlying this only negligible reduction in maximum clock frequency is that most of the additional logic required by the STS SESD ASIC can be kept off the critical path and has thus little influence on the maximum clock frequency.
Detection Throughput: Fig. 11 shows the complexity/ performance trade-off of the reference 2 -norm hard-output SESD and the soft-output STS described in Section VI in terms of the throughput
measured in information-bits per second as a function of the minimum required SNR to achieve a FER of 0.01. Here, f clk is the maximum clock frequency of the circuit under consideration and E[C] denotes the average (over channel and noise realizations) number of clock cycles required to detect a symbol vector. Note that the dedicated hard-output SESD implementation achieves a slightly higher throughput than the STS SESD implementation with 2 L max = 0. This is due to the slightly higher maximum clock frequency of the corresponding hard-output SESD (see Table I ).
VIII. CONCLUSIONS
Sphere decoding is a suitable tool to implement MIMO detection with variable complexity/performance trade-off. In particular, adjusting the LLR clipping level and imposing maximum run-time constraints is an efficient way of realizing an entire family of decoders with (error rate) performance ranging from exact max-log soft-output to hard-output SIC. The keys to achieving low hardware complexity are the single tree-search strategy described in Section III-B, MMSE-SQRD preprocessing, LLR clipping, and run-time constraints with maximum-first scheduling. Our VLSI implementation results indicate that the silicon area required by a soft-output STS SESD is only about 58% higher than the area required for a corresponding 2 -norm hard-output SESD implementation. This paves the way for a VLSI implementation of iterative MIMO detection based on sphere decoding.
