Abstract-Sphere decoding (SD) allows to solve highdimensional MIMO maximum likelihood detection problems with significantly lower complexity than other methods. The SD algorithm has, however, mostly only been analyzed with DSP implementations in mind. We show that VLSI implementations call for new performance metrics, analyze the resulting implementation tradeoffs for the decoding of complex signal constellations, and we develop design guidelines and a generic architecture. When using the ∞ -norm for the sphere constraint instead of the 2 -norm, significant reductions in circuit complexity and improvements in tree pruning efficiency are possible at a minimum performance penalty. As a proof of concept, a high performance ASIC implementation is presented.
I. INTRODUCTION
Many problems in mobile communications can be described with a simple linear multiple-input multiple-output (MIMO) model. Examples include multiantenna systems or multiuser detection in CDMA. The corresponding complex-valued inputoutput relation is y = Hs + n
where H is an N × M effective channel matrix and y is the N -dimensional received vector, disturbed by the complex additive white Gaussian noise vector n. The symbols s ∈ O M are composed of values independently chosen from a complex constellation O. Real constellations can be considered as a special case. When H is known at the receiver, the maximum likelihood (ML) detector is given bŷ s = arg min
In fading MIMO channels, ML detection exploits N th order diversity, which is not achieved by linear and successive cancellation receivers. Hence, ML detection is attractive in the high SNR regime. Unfortunately, the complexity of an exhaustive search implementation of (2) is exponential in the transmission rate. For the case where O M is a (real) integer lattice L M , sphere decoding (SD) has been proposed by Pohst [1] as an alternative approach, which has recently been introduced into communications. The algorithm achieves ML performance with an expected complexity that is only polynomial in the rate [2] . Numerous optimizations have been proposed to reduce the implementation complexity of the original SD algorithm on general purpose processors and digital signal processors (DSPs) [3] . However, the VLSI implementation of the algorithm has only received limited attention so far.
Contributions:
In this paper, we describe implementation tradeoffs for high-throughput sphere decoding with complex modulation schemes in VLSI and introduce an efficient architecture. We also present slightly suboptimal numerical simplifications that significantly reduce the circuit complexity and increase the efficiency of the algorithm, while preserving diversity order in fading MIMO channels.
Outline: In Section II, we review the state of the art in sphere decoding and emphasize key concepts. Section III introduces an efficient high-level VLSI architecture for highthroughput SD implementations. In Section IV, implementation options for the realization of a complex SD are considered and compared. In Section V, we propose the square-root sphere criterion and its suboptimal variations that lead to significantly reduced circuit complexity and higher throughput. Section VI finally describes the implementation of a high performance 4 × 4 16-QAM sphere decoding chip.
Notation: a p denotes the p -norm of the vector a. When the subscript is omitted, we tacitly mean the 2 -norm (the Euclidean norm). E{·} stands for the expectation operator.
II. STATE OF THE ART IN SPHERE DECODING
Under the term sphere decoding we subsume the original SD algorithm [1] and all the variations and extensions proposed later [4] - [6] . The algorithm consists of four key concepts that need to be clearly differentiated:
A. Sphere Constraint
The main idea is to reduce the number of points that need to be considered in the search for the ML solution. The list of candidates is constrained to only those points Hs that lie inside a sphere with a given radius C around the received point y:
We refer to equation (3) as the sphere constraint (SC).
B. Tree Pruning
Efficient checking of the SC only becomes feasible after transforming (2) into an equivalent problem with triangular channel matrix. Assuming that N ≥ M , this goal can, for example, be achieved using the QR decomposition of the channel H = QR with the M × M upper triangular matrix R and the N × M unitary matrix Q. In this case, the SC (3) becomesd
Rs −ŷ where the M -dimensionalŷ = Q H y, and where in the case N > M the radius C needs to be adjusted to yield the modified radiusĈ. As a result of the triangular structure of R, the distanced(s) can now be computed recursively as follows:
We term T i (s) the partial Euclidean distance (PED) of a symbol s at level i. All possible symbols s ∈ O M can now be considered by tree traversal with the root at i = M + 1. Since the PED increases monotonically from level to level, a branch together with all its children can be pruned whenever its PED exceedsĈ 2 . The remaining branches belong to an admissible set of constellation points that need to be followed further. Ideally this set should constitute only a small subset of the constellation O. The goal of the SD algorithm is to prune large parts of the tree, such that the complexity of the search for the ML solution is greatly reduced.
One of the main issues with the original Pohst algorithm is the choice of the sphere radiusĈ. If chosen too small, no solution is found -however, too many nodes need to be considered for a radius chosen too large. If we setĈ 2 =d(s) whenever an admissible point s is found [6] , the sphere radius decreases throughout the algorithm, and as a result fewer points need to be considered. This radius updating can be performed in a smart fashion, such that a restart of the algorithm with each radius update can be avoided [3] .
C. Admissible Intervals
When the underlying constellation O is real, e.g., if it constitutes a real lattice, we can easily verify that at a given level i, any constellation point s i that lies between two admissible points is also admissible. As a consequence, the admissible set is actually an interval. Checking whether s i is in the admissible set therefore only amounts to comparing s i to the boundaries of that interval.
The latter point often leads to the impression that sphere decoding concepts are only applicable to real lattices. However, it is crucial to realize that only the first two ideas described above are actually prerequisites for the tremendous complexity savings of the SD algorithm over a brute force search. In fact, sphere decoding is applicable to arbitrary sets of constellation points, in particular to complex lattices [7] , [8] . In this case, the PED computation becomes
but membership in the admissible set cannot simply be determined using the bounds of an admissible interval. We shall explore solutions to this problem in Section IV.
D. Schnorr-Euchner Enumeration
Without radius updating, the order in which nodes are visited is irrelevant for the pruning of the tree. However, radius updating leads to the greatest complexity reduction if symbols with smaller distance are visited first. Also, in order to find admissible symbols as fast as possible, depth-first traversal of the tree is mandatory. With the Schnorr-Euchner (SE) enumeration [5] , on each level nodes with the smallest PED are followed first, leading to a more rapid shrinkage of the sphere radius. Hence the tree is pruned more efficiently. As an additional advantage of the SE enumeration, the initial sphere radius can be set to infinity. In that case, the first admissible point found is always the so called Babai point or zero-forcing decision-feedback point.
Summarizing, combined radius updating with SE enumeration is highly recommended, whenever applicable. We shall investigate how to perform the enumeration in practical implementations in Section IV.
III. CONSIDERATIONS FOR VLSI IMPLEMENTATION
Dedicated VLSI architectures differ from implementations on DSPs through their potential for massively parallel processing and the availability of customized operations and operation sequences that can be executed in a single cycle. The potential of an algorithm to exploit these properties is crucial to guarantee an efficient high-throughput implementation.
A. General Architecture
The VLSI architecture of a high-throughput SD application specific integrated circuit (ASIC) should be designed to ensure that the decoder examines the branches of a new node in each cycle and that no node in the tree is ever visited twice. This paradigm guarantees maximum throughput efficiency. It is achieved by partitioning the decoder into two parallel units:
1) The metric computation unit (MCU) starts from T i+1 (s) (i.e., the metric of the branch that leads to the current node) and finds the starting point for the SE enumeration along with the PED T i (s) of the corresponding branch. When the bottom of the tree is reached, the MCU stores the symbol corresponding to the current path and updates the radius. In this case, or if the admissible set is empty, a new node branching off the current path further up in the tree is visited next. If no more valid branches are found, the decoder stops. 2) The metric enumeration unit (MEU) operates solely on nodes that have already been visited by the MCU. It carries out the SE enumeration to find the branch with the smallest PED among those that have not been visited yet. It keeps a list of these admissible children for all nodes on the current path. When the MCU reaches a leaf or a dead end, the MEU can decide immediately where the search should be continued in the next cycle.
Tree traversal is performed depth-first. The described procedure is exemplified in Fig. 1 for M = 3 and BPSK modulation. On the left, the branches that are examined by the MCU on the way down are marked. The MEU follows along the same nodes with one cycle delay and computes the PED of the branch that would follow next in the SE enumeration. After cycle 3, the MCU has found the first complete candidate symbol. In the meantime, the MEU has determined that only the branch to node B still meets the updated SC and has already computed its PED. Therefore, the MCU can proceed immediately to check the branch leading to node C in cycle 4 and finally to the leaf E in cycle 5. Note that no cycles are wasted to slowly climb up the tree step-by-step.
B. Complexity and Performance Metric
For the architecture described above, we introduce suitable metrics that characterize throughput and implementation complexity:
1) Performance: The recursive tree-pruning scheme applied in the SD algorithm suggests that the described processing of a single node in one step (i.e., in one cycle) is the maximum degree of parallelism that can be achieved while fully preserving its complexity advantage over an exhaustive search. Processing multiple subsequent nodes in parallel would instead gradually lead back to the implementation of an exhaustive search decoder [9] . Although this approach opens up further tradeoffs between chip area and throughput, we do not pursue it here. In a one node per cycle architecture, the overall performance of the decoder is governed by two criteria: 1) the average number of visited nodes E{K} before the ML solution is found, which determines the number of cycles per symbol 2) the cycle time t CLK , which is given by the critical path through the longest chain of consecutive operations in one cycle. The first criterion characterizes the efficiency of the tree pruning and can be considered purely on an algorithmic level. The second criterion is concerned with the efficiency of the hardware implementation. The overall throughput is given by Φ = M log 2 |O|/(E{K} t CLK ). Note that, as opposed to optimizations that target DSPs, the pure number of operations (FLOPs) is of little significance.
2) Circuit Complexity: The circuit complexity is measured by the area required for the integration of all processing elements and the memory. Just as the delay it varies significantly depending on the type of operation and on the associated wordwidth. While tradeoffs between area and delay are possible, we strive for maximum throughput in this paper.
IV. SCHNORR-EUCHNER ENUMERATION IN ARBITRARY SETS AND COMPLEX LATTICES
The most critical part in the design of a SD is finding the admissible set, and the implementation of the SE enumeration. In the complex case, there is no admissible interval. Three approaches to identify the admissible set have been proposed: 
1) Real Decomposition:
In the case of rectangular (complex) Q-QAM constellations, a common procedure is to decompose the M -dimensional complex problem into an equivalent real-valued problem according to
The result is a real lattice L 2M of dimension 2M with √ Q constellation points per dimension. The decoder can now explicitly compute the center point of an admissible interval, from which it proceeds with the enumeration in a zig-zag fashion [5] until a constellation point is outside the admissible interval. This condition can be checked based on explicit computation of the boundaries or by simply checking the SC [3] . However, traversing the resulting, deeper tree with fewer branches per node reduces the potential for parallel processing compared to a more shallow tree with more branches per node. Since the number of visited nodes is the main performance metric in VLSI implementations, the decomposition of the complex into a real lattice entails a significant performance degradation (cf. Fig. 2) . Moreover, on a circuit level, no advantage can be taken from the orthogonality of the real and complex part and the symmetries in the complex constellations. Additionally, the computation of the center point of the admissible interval for SE enumeration is slow. Consequently, decomposition into a real lattice is not advisable for a highthroughput VLSI implementation.
2) Exhaustive Search: To directly determine the admissible set, an exhaustive search over the full constellation O can be performed. Explicit sorting of the PEDs is subsequently used to realize the SE enumeration. As opposed to the first solution, this method allows for arbitrary complex constellations and does not increase the depth of the search tree.
At first sight, the full search appears to have a very high implementation complexity as the PEDs need to be computed for all candidate constellations. However, (5) can easily be decomposed into 
As a result, most of the costly operations can be shared among all candidate points.
The drawback of this approach is that all PEDs on the path down need to be stored to perform the enumeration. Additionally, the decision for the smallest metric in the MCU involves a search over all constellations, which is slow and leads to a long cycle time.
3) Hybrid Schemes: Depending on the constellation, hybrid approaches between exhaustive search and ordered enumeration may also be possible, as proposed in [7] : Starting from PSK modulation, admissible intervals are defined based on the phase of the constellation points. Subsequently, QAM modulation is described as the union of PSK subsets, within which enumeration is straightforward. SE ordering across subsets is achieved through explicit sorting of the PEDs.
The difficulty in the application of this approach is the computation of the starting points for the PSK enumerations. However, with a specific modulation scheme in mind it can be performed by simple direct comparisons between the real and imaginary parts, and no angles need to be computed, as opposed to [7] . As a result, this approach generally yields the lowest circuit complexity for QAM modulation and requires only a few PEDs to be stored and compared.
V. THE SQUARE ROOT SPHERE CRITERION
The computation of the PED on each level can be decomposed into the computation of an error term e i (s) and the recursive update of the PED:
The squaring operation in (7) consumes a large chip area and is slow, limiting the performance of the PED computation. Moreover, controlling the dynamic range of the numbers becomes a more severe issue after squaring. In order to eliminate the square, we have proposed to operate on the square root T i (s) = T i (s) of the PEDs [8] , which yields an equivalent detector. By taking the square root of (7), we obtain
For this type of expression, numerous approximations of the form x 2 + y 2 ≈ f (|x|, |y|) are available. Four approximations with efficient VLSI implementations are given in Table I . They all lead to new (suboptimal) detectors, which can be interpreted as minimizing another norm for the triangularized problem (instead of the Euclidean or 2 -norm, which corresponds to the ML solution). In particular, the first approximation leads to the minimization of the 1 -norm of Rs −ŷ [10] :
The second approximation leads to a minimax optimization, or the minimization of the ∞ -norm:
The remaining approximations correspond to hybrid norms. Note that only the 2 -norm exhibits the property that the minimization on the original and the triangularized problem yield the same solution.
In the complex case, the same approximations can be used to compute the absolute value of the complex error term |e i (s)| from its real and imaginary part. The reduction in circuit complexity is significant. Still, the impact on bit error rate (BER) is small. We use the exact absolute value computation in our simulations.
1) Impact on Complexity:
Since
the accumulated distance at the bottom of the tree is smallest for the ∞ -norm. After the radius update, it is therefore more probable that the PED for a certain node further up in the tree is larger than the new radius, as compared to the 2 -norm case. Therefore, more branches in the tree will be pruned. The situation is just the opposite for the 1 -norm, therefore tree pruning becomes less effective. The impact on complexity is clearly visible in Fig. 3 . It is remarkable that employing the ∞ -norm almost halves the complexity at low SNR.
2) Impact on Bit Error Rate: The approximations of the 2 -norm warp the sphere that is searched when looking for the ML solution. However, we can show analytically that full diversity is preserved regardless of the particular norm employed. This fact can be verified in Fig. 4 . Among the approximations considered, the ∞ -norm detector shows the most pronounced constant loss in BER, although the degradation may well be acceptable in many cases. Concluding, the ∞ -norm approximation represents a very attractive approach as it leads to greatly reduced search complexity as well as reduced chip area at only a minor BER penalty.
VI. ASIC IMPLEMENTATION RESULTS
As a proof of concept, a SD ASIC for a 4 × 4 system with 16-QAM modulation has been realized in a 0.25 µm technology. It is based on a direct implementation of complex SE enumeration using a decomposition into three nested PSK constellations. For the metric computation, the square root sphere algorithm is used in conjunction with the ∞ -norm. The critical path starts with the computation of b i (s), followed by the part of the MCU that finds the starting point for the PSK enumeration, and the metric computation. It then continues with the selection of the minimum and into the MEU, adding up to a total delay of 13.5 ns, allowing for a clock of 75 MHz. The active core area of the chip covers only 1 mm 2 . The implementation is about two times smaller and exceeds the throughput of the K-best lattice decoder in [11] by about a factor of three at SNR = 20 dB. Compared to a previously presented implementation [8] , it only requires a third of the area and achieves a 50% higher clock rate. Also, fewer iterations are required for the decoding due to the complexity reduction by the ∞ -norm approximation and other minor optimizations. The result is a more than doubled throughput at SNR = 20dB. The technical specifications and the throughput of the ASIC at different SNRs are given in Fig. 5 together with the layout.
VII. CONCLUSIONS
We have presented a VLSI architecture for high performance sphere decoding in complex lattices. Implementation techniques to realize Schnorr-Euchner enumeration in complex sets are described and compared. To reduce circuit complexity and cycle time, approximations to the ML criterion are proposed, which preserve diversity order. In particular, an ∞ -norm approximation is shown to additionally improve the tree pruning process in the SD significantly with only a small SNR penalty. The feasibility was shown with an actual ASIC implementation that, to the best of our knowledge, exceeds the throughput of all other presented VLSI realizations.
