Abstract-Multiple-input multiple-output (MIMO) wireless transmission imposes huge challenges on the design of efficient hardware architectures for iterative receivers. A major challenge is soft-input soft-output (SISO) MIMO demapping, often approached by sphere decoding (SD). In this paper, we introduce the-to our best knowledge-first VLSI architecture for SISO SD applying a single tree-search approach. Compared with a soft-output-only base architecture similar to the one proposed by Studer et al. in IEEE J-SAC 2008, the architectural modifications for soft input still allow a one-node-per-cycle execution. For a 4×4 16-QAM system, the area increases by 57 % and the operating frequency degrades by 34 % only.
I. INTRODUCTION
Multiple-input multiple-output (MIMO) wireless transmissions utilizing spatial multiplexing achieve an increased spectral efficiency compared with single-antenna systems. This improvement comes at the cost of an increased signal-demapping complexity, which becomes particularly critical for iterative receivers [1] . Recent developments of soft-input soft-output (SISO) MIMO-demapping algorithms reduced this complexity significantly. Prominent demapping algorithms are k-best and list-based approaches [2] , [3] , Markov chain Monte Carlo algorithms (MCMC) [4] and single tree-search (STS) sphere decoders (SD) [5] . The STS approach is often preferred since it guarantees max-log maximum a posteriori (MAP) optimality.
Efficient VLSI implementations have been proposed for soft-output-only STS SDs [6] , [7] exploiting geometric properties of QAM constellations. These geometric relations help determining a search order, defined as enumeration, leading to a fast average tree-search convergence. The SISO STS complexity has been prohibitive for VLSI implementations so far, because geometric relations are not applicable directly. Recent improvements of soft-input enumeration strategies moved SISO STS SD closer to VLSI architectures [8] .
Contributions: In this paper, we introduce the-to our best knowledge-first VLSI architecture for SISO STS SD. It is based on a soft-output-only architecture following the onenode-per-cycle (ONPC) paradigm used by [6] . The SISO modifications are modular enough to be applied to other existing STS SD architectures and still allow ONPC execution. Compared with a soft-output-only architecture, the area increases by 57 % and the clock frequency degrades by 34 % for a 4 × 4 16-QAM system. Thus, this architecture enables STS-based iterative wireless MIMO receivers.
The paper is organized as follows: Section II sums up the basics of SISO STS SD, extended by the soft-input enumeration strategy in Section III. Section IV describes important implementation aspects of the scalable VLSI architecture. In Section V the parameter design space of the SISO STS architecture as well as area, timing and throughput are discussed.
II. SINGLE TREE-SEARCH SOFT-INPUT SPHERE DECODING
A spatial-multiplexing MIMO scheme with M T transmit and M R ≥ M T receive antennas is assumed [1] . Each transmit antenna sends one of the 2 Q complex elements of the symbol set O defined by the modulation alphabet, which is assumed to be the same for every antenna. Each vector
MT results from mapping M T Q bits x i,b ∈ {+1, −1} to an element of O MT , with i being the antenna index and b the bit index for one scalar symbol s i .
The received symbol vector y ∈ C MR is given by y = Hs + n, where H ∈ C MR×MT is the channel matrix and n ∈ C MR is a white circular Gaussian noise vector with variance N 0 per element. For tree-search SD, H is typically QR-decomposed (QRD) with H = QR, Q ∈ C MR×MT and Q H Q = I and R ∈ C MT×MT being an upper triangular matrix [1] , [5] . Withỹ = Q H y andñ = Q H n, this results iñ y = Rs +ñ .
According to [5] , the triangular matrix R in equation (1) allows to formulate the SISO max-log MAP MIMO detection problem as STS within a 2 Q -ary complete tree. The tree levels correspond to the M T antennas, each node s i ∈ O on tree level i is a received symbol candidate, with s 1 being a leaf node. An exhaustive search in such a tree leads to a worst-case run-time complexity of O(2 QMT ). As formalized in equations (2) to (4), metric increments M C (s i ) for channel-based and M A (s i ) for a priori-based information are summed up to a total increment M P (s i ). P[s i ] is the symbol probability computed from the a priori log-likelihood ratios (LLRs)
The sum of metric increments along a path from the root to node s i yields the partial metric M P (s (i) ) for a partial symbol vector 
These metric computations dominate the detection complexity. For a depth-first tree search, the pruning of sub-trees lying outside a hypersphere with a radius not improving λ
provides a heuristic for complexity reduction which is sensitive to the visiting order [s
A Schnorr-Euchner (SE) order [9] provides a very fast search convergence by the following pruning criteria [6] , typically defining the pruning metrics
, ∀b
If inequality (6) holds, the current node and its sub-tree are pruned, otherwise a step down is performed in the tree. If inequality (7) holds, the enumeration on level j stops, otherwise the sibling of the current node is enumerated. The arguments of the max operators in (6) and (7) are the sets A and B respectively in [6] . We define an examined node (as used in [6] and [7] ) as a node s j that has been checked against at least one pruning criterion, leading to the complexity measure number of examined nodes per detected symbol vector N en . If a leaf node with M P (s) ≥ λ MAP,cur is not pruned by inequalities (6) or (7), the values {Λ
MAP,cur , the current leaf becomes the new MAP solution and the extrinsic counter-hypothesis metrics {Λ
}. Many methods exist to reduce N en , like sorted QRD (SQRD) [10] and extrinsic LLR clipping [5] . The latter one limits the allowed range for 
Please note that equation (8) is stricter than the min{} function used in [5] where a post-processing step is used to guarantee |L
saves 50 % of the comparisons required for clipping. Experiments indicate that E[N en ] differs only marginally between the two clipping methods. Moreover, radius tightening further reduces N en . A hardware-friendly approximation of M A (s i ) for statistically independent symbols, including tightening and still guaranteeing max-log-optimal a posteriori LLRs, has been proposed in [5] (with unipolar bits
III. THE HYBRID-ENUMERATION ALGORITHM
A major issue of SD algorithms is the enumeration process, namely the determination of the SE order [s
representing the k th candidate for node s i , in ascending order of M P . A straightforward implementation by computing and fully sorting the set {M P (s (k) i )} is very expensive and inefficient. For the soft-output-only case, the geometric properties of the QAM constellation can be exploited to avoid full sorting and thus save most of the computations, as proposed in [6] , [7] , [11] . However, in iterative receivers these optimizations are not usable directly because the geometrybased order is scrambled by the a priori information. A viable approach towards efficient soft-input enumeration is given by the hybrid-enumeration algorithm presented in [8] . Its basic idea is to split the enumeration of {M P (s
On the one hand, the enumeration of {M C (s (k) i )} is the same as in the soft-output-only case, thus allowing to reuse any of the related aforementioned efficient methods, even in later iterations. On the other hand, the enumeration of {M A (s (k) i )} is efficient as well since the linear sorting of the symbol set O needs to be performed independently only once per antenna.
According to [8] , the channel-and a priori-based enumerations independently select candidate symbols s at each step k. The hybrid enumeration simply selects the candidate with the lower metric M P between these two.
As visualized in Figure 1 , the strict SE order is not preserved, hence the inequality M P (s
i ), ∀l > k does not hold any more. Thus, a modification of the pruning criteria is needed to avoid the erroneous exclusion of the MAP or counter-hypothesis solutions. For l > k, the inequalities
A,i ) lead to April 7, 2010 2 DRAFT Accepted for IEEE Transactions on Circuits and Systems II Express Briefs, May 2010. This draft will not be updated any more. Please refer to IEEE Xplore for the final version.
* ) The final publication will appear with the modified title "A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding".
, providing an alternative lower bound for tree pruning. Thus, in [8] the pruning metric of inequality (7) on the current tree level i is re-defined as
Compared with the SE order, pruning metric (10) preserves the error-rate performance at the price of a slight increase in N en . For a more detailed description and analysis of the hybrid-enumeration algorithm, the reader is referred to [8] .
IV. A VLSI ARCHITECTURE FOR STS SOFT-INPUT SPHERE DECODING
In this section, a VLSI architecture for SISO STS SD is introduced. It is derived from a soft-output-only depth-first STS base architecture extended by soft-input processing. The main challenges are discussed that arise from the implementation of efficient soft-input extensions according to the hybridenumeration scheme. Further algorithmic optimizations such as LLR correction proposed in [5] are orthogonal to the base architecture and can be implemented on top of it.
A. Soft-Output-Only Base Architecture
The soft-output-only base STS architecture, composed of the light gray blocks in Figure 2 , follows the ONPC execution principle used by Studer et al. in [6] . Its architectural structure is derived from the observation that the tree search is composed of three basic control-flow steps:
i) Vertical steps (x) down from tree level i to i−1 enumerate the first child node s i . This requires a quantization step Q to find the QAM symbol next toỹ i , followed by the computation of M P (s (1) i−1 ). The result of Q is used to initialize the enumeration on the tree level i − 1 and by the pruning-criteria check for s 
has to be performed next. The M P history ({) unit stores the partial metrics M P (s (i) ), recursively implements equation (5) and provides its result to unit z for pruning and LLR clipping by equation (8) .
In a depth-first SD, the tree-traversal control flow exhibits severe data and control dependencies. In order to achieve a throughput of one examined node per cycle, the base architecture executes the pruning check for node s 
is saved in a preferredsiblings cache (|) for later use during a step up in the tree. Thus, in cycle n + 1 the availability of a valid node for the next pruning check is guaranteed.
The enumeration unit of the base architecture employs the column-wise zig-zag enumeration strategy (}) presented in [11] . Compared with circular PSK-like enumeration [6] , the column-wise enumeration allows a much more regular hardware implementation. Furthermore, for 64 QAM and higher modulation orders it requires less comparisons. Since there is no assumption on the mapping between QAM symbols and bits, two run-time-programmable lookup tables, named mapper M and demapper D respectively, are used for the conversion between the symbol and the bit representations.
STS Control FSM

R,ỹ, {L
A i,b } yi − M T j=i+1 Ri,jsj MT2 Q Enumerated Nodes Flags col. zig-zag M C M C M C M C M A M A M A = 0 M M D D Q Pruning Criteria, {Λ MAP i,b }, λ MAP , {x MAP i,b } {L E i,b } min min
B. Soft-Input Extensions
In order to extend the base architecture presented in Section IV-A, mainly extra units for the a priori-based enumeration have to be added, along with slight changes in the columnwise zig-zag implementation. These extensions correspond to the dark gray units in Figure 2 .
1) Enumerated-nodes flags:
Both channel-and a prioribased enumeration units have to skip nodes that have already been enumerated, because the local enumeration orders for M C and M A differ from the global enumeration order. Therefore, both units need the list of enumerated nodes to guarantee that each node is enumerated only once. This flag vector of 2 Q bits per antenna is maintained in unit~.
2) Modified column-wise zig-zag enumeration:
Skipping an arbitrary number of nodes implies modifications to the column-wise zig-zag implementation (}). Compared with the base architecture, the new column-enumeration unit does not keep internal zig-zag states any more. Instead, each column enumeration performs a minimum search over the linear distances between the quantized imaginary part Q(Im{ỹ i − MT j=i+1 R i,j s j }) and all rows {Im{s i |s i ∈ O}} masked by the enumerated-nodes flags. The hardware complexity increases only moderately, because distance computations are the same for all columns and operate on words of only Q/2 + 1 bits.
3 
and an order defined by
Q } is the lack of relations among a priori LLRs. Thus, the only known solution is the full computation and sorting of {M A } i .
First, the computation of {M A } i () requires 2 Q − Q − 1 additions per antenna and received vector. Due to the ONPC principle and the structure of (9), the number of hardware adders can be reduced by resource sharing. The first enumeration step always results in d The second issue is sorting {M A } i . Since latency is typically a serious issue for run-time constrained depth-first SD, an approach has been chosen that does not add latency for the sorting of {M A } i . The ONPC principle allows a minimum search () for M A,min over the set {M A } i for the enumeration of the current antenna i, masked by the enumerated-nodes flags. The resulting binary tree of compare-select (CS) units would dominate the critical path already for 16 QAM.
However, the properties of equation (9) can be exploited to remove almost all comparators and CS dependencies for the first three CS levels. The principle can be explained easily by considering the removal of the first level: for pairs of {M A (s
). This kind of decision does not need any metric comparison but can be determined by singlebit comparisons of sign bits and enumerated-nodes flags. Selecting the minimum of 4-tuples (first two CS tree levels) differing in only two bits {b {m,n} |x
However, this extra comparison is the same for all 4-tuple sub-trees and does not depend on intermediate results generated in the CS tree. Therefore, the critical path is significantly reduced. The extension to 8-tuples (first three CS tree levels) has a total of only six parallel comparators. Thus, only one CS unit and two 8:1 multiplexers are required for 16 QAM and only seven CS units and eight 8:1 multiplexers for 64 QAM. Compared with a full CS tree, the comparator savings are 53 % in total and 50 % in the critical path for 16 QAM and 79 % in total and 33 % in the critical path for 64 QAM. Extensions to higher orders than 8-tuples are possible but would result in an exponential complexity increase.
4) Pruning-criteria checks:
In [6] , the checks of the pruning criteria of equations (6) and (7) have been simplified to a single pruning-criterion check of equation (7) in order to reduce hardware complexity, at the cost of a slight increase of N en . For the SISO STS SD architecture proposed in this paper, the implementation of two different pruning criteria in unit z is mandatory to prevent a further significant increase of N en . In order to avoid extra delays on the critical path, the pruningcriteria checks are not implemented as maximum searches but as pairs of M T 2 Q fully parallel comparators M 
V. ASIC SYNTHESIS RESULTS
The architecture presented in the previous section has been implemented in VHDL including parameters for word lengths, M T , QAM order and a switch to enable/disable soft-input support. A representative set of parameter combinations has been instantiated by layout-aware gate-level synthesis 1 . Since both the soft-output-only base architecture and the SISO architecture follow the ONPC principle, their throughput Θ can be determined by
with r being the code rate and E[N en ] being the average N en . The curves for the iterative Θ and the cumulative E[N en ] for a 4 × 4 16-QAM MIMO system 2 achieving a frame error rate (FER) of 1 % are given in Figure 3 , including as a reference the cumulative E[N en ] obtained by SE ordering and floatingpoint operations. In the 4 th iteration the hybrid-enumeration algorithm introduces an overhead of less than 28 % in terms of E[N en ]. The least-effort throughput in Figure 3 is derived from equation (11) by selecting the minimum cumulative E[N en ] among all iterations for a specific SNR. The intersections of the cumulative E[N en ] curves determine the SNR points for changing the number of iterations. In Figure 3 the switching points are marked by x (1 ⇄ 2 iterations), by y (2 ⇄ 3 iterations) and by z (3 ⇄ 4 iterations).
Area and delay of this architecture are quite sensitive to the fixed-point word lengths. Therefore, the word lengths have 1 UMC 90 nm standard-performance CMOS library, typical case, Synopsys Design Compiler 2009.06-sp1 in topographical mode. 2 Throughout this paper we use a system with an i.i.d. Rayleigh fading channel, perfect channel knowledge and SQRD [10] . The BICM transmission is set up with a convolutional channel code (rate 1/2, generator polynomials [133o, 171o], constraint length 7) decoded by a max-log BCJR channel decoder with perfect termination knowledge and an S-random interleaver corresponding to 512 information bits. The SNR is defined as SNR = M T Es/N 0 , with Es = E[|s| 2 ], s ∈ O. P[s i ] is approximated by equation (9) . The VLSI architecture internally operates on normalized metrics Mnorm. = N 0 M to avoid division by N 0 , normalized clipping levels are given by N 0 L E max . Area is measured in gate equivalents (GEs). One GE corresponds to the area of a two-input drive-one NAND gate.
been carefully selected to make the FER-performance loss negligible with respect to floating-point operation 3 . Figure 4 shows the synthesis results for representative parameter sets. The results for the soft-output-only case are comparable to the implementation published in [6] . Since the two base architectures are similar, they are close in terms of area. The timing differs, mainly for two reasons. First, Figure 4 shows pre-layout synthesis results for a 90 nm technology whereas those in [6] 379 MHz to 250 MHz. We can conclude that the additional cost for soft-input is affordable at the prospect of working at lower SNR regimes with iterative systems. The proposed architecture scales almost linearly with M T in terms of area. The critical path degrades only by less than 10 % when doubling M T . When increasing the QAM order by a factor of 4 in the soft-input case, the area is less than doubled while the frequency degrades by less than 20-25 %, despite the enumeration being significantly affected.
VI. CONCLUSION To our best knowledge, we introduced the first SISO STS SD architecture, enabling iterative STS SD-based receivers. The parametrized architecture offers very good scalability over M T and the QAM order. The approximate hybrid-enumeration method enables the implementation of iterative STS-based MIMO receivers, although high data-rate communication systems may require multiple parallel SD instances to meet the throughput constraints. We believe that the algorithms and hardware-design principles presented in this paper are suitable for most kinds of SD architectures. Our future development will focus on further enhancements of the architecture, based for instance on the ideas proposed in [6] .
VII. ACKNOWLEDGEMENT
