Abstract-Sphere decoding (SD) is a promising means for implementing high-performance data detection in multiple-input multiple-output (MIMO) wireless communication systems. In this paper, we focus on the register transfer level implementation of SD with minimum area-delay product for application in wideband MIMO communication systems, such as IEEE 802.11n, where multiple SD cores need to be instantiated. The basic architectural considerations and the proposed optimizations are explained based on hard-output SD, but are also applicable to soft-output SD. Corresponding VLSI implementation results (for both hard-output and soft-output SD) show an improvement in the area-delay product by almost 50 % compared to that of other SD implementations reported in the literature.
I. INTRODUCTION
The ability to increase throughput and range without requiring more bandwidth or transmit power renders multiple-input multiple-output (MIMO) communication the key technology for wideband communication standards [1] . The MIMO gains come, however, at the cost of (often significant) complexity required for data detection. Maximum-likelihood (ML) detection provides excellent error-rate performance, but a straightforward implementation requires to exhaustively test all possible transmit symbols. For high spectral efficiencies, the exponential complexity increase of the number of candidate symbols (in the number of transmit antennas) is prohibitive, even for practical data-rates.
The sphere decoding (SD) algorithm [2] is one of the most promising methods for ML detection in MIMO systems, since its average complexity is far below that of an exhaustive search. The basic idea behind SD is to transform MIMO detection into a weighted tree-search problem, which is then solved efficiently by a branch-and-bound procedure. The main drawback of this approach lies in the fact that the decoding effort for SD is essentially determined by the number of nodes to be examined in that tree for each received symbol. For most VLSI implementations of SD, the number of visited nodes corresponds to the number of clock cycles required for each symbol [3] . This number depends on the channel and the noise realization. In the worst-case, all nodes in the tree must be examined, corresponding to the (often prohibitive) complexity of an exhaustive search. Since on-chip storage and higher-layer requirements limit the latency that may be inferred to support the processing of symbols for which the decoding effort lies far above the average, the worst-case complexity of SD renders its application in real-world systems difficult. This problem can be mitigated by limiting the maximum decoding effort through early termination of the decoding process, e.g., [4] . Such constraints, however, lead to a tradeoff between the maximum decoding effort and the receiver performance. A universally applicable VLSI architecture for a MIMO detector suitable for wideband MIMO systems must therefore be tailored to provide a straightforward solution to adjust this tradeoff and minimize overall silicon area for a given minimum performance requirement.
Outline and Contributions: In this paper, we describe the design and optimization of a SD core that is suitable for wideband MIMO systems. To this end, we first review the SD algorithm and we argue that the optimization target for each SD core in a wideband system differs from that usually employed in narrow-band MIMO systems, where a single SD-core can handle the throughput requirement (Sec. II). In Sec. III, we describe the register-transfer-level (RTL) architecture for hard-decision SD and propose a low-complexity approximation to the Schnorr-Euchner (SE) enumeration. We also introduce pipeline interleaving and analyze the level of pipelining required to yield the lowest area-delay product. In Sec. IV, we discuss our results and present a comparison to other SD implementations.
For better understanding and due to the limited page count, we focus on hard-output SD throughout the paper. The presented architecture, the proposed enumeration scheme, and pipeline interleaving can, however, also be applied to softoutput SD architectures (e.g.,single tree-search (STS) SD [5] ). To support this claim, corresponding performance and implementation results are presented.
II. SPHERE DECODING AND WIDEBAND MIMO RECEIVER ARCHITECTURE
In the following, we introduce the MIMO system model, summarize the SD algorithm, and provide an overview of a wideband MIMO receiver architecture for the case where a single SD core is insufficient to meet the throughput-requirements associated with a high communication bandwidth.
A. MIMO Detection as a Weighted Tree-Search Problem
System model: We consider a MIMO system employing spatial-multiplexing with M T transmit and M R ≥ M T receive antennas. The data to be transmitted is mapped to M T -dimensional transmit vectors s ∈ O MT , where O is the complex-valued scalar constellation. The baseband inputoutput relation, as seen by the MIMO detector, is given by
where H is the complex-valued M R × M T channel matrix and n is an i.i.d. circularly symmetric complex Gaussian noise vector of dimension M R . The ML detection rule for the inputoutput relation in (1) is given bŷ s = arg min
Sphere decoding: SD [6] starts from the QR decomposition of the channel matrix H = QR, with Q being unitary of dimension M R ×M T and R being M T ×M T upper-triangular. This decomposition allows to rewrite (2) aŝ s = arg min
Thanks to the upper-triangularity of R, the minimization problem (3) can be interpreted as a weighted tree-search problem where the nodes of the tree on level i are associated with a partial symbol vector
T and with a corresponding partial Euclidean distance (PED) d i s (i) . Fig. 1 illustrates the corresponding weighted tree for a MIMO system with M T = M R = 3 using QPSK modulation. When starting from the root of the tree (at level i = M T + 1 with d MT+1 = 0), the PEDs can efficiently be computed in a recursive manner according to
using the definition
when proceeding from a parent node on level i + 1 to one of its children on level i. The ML solution corresponds to the path through the tree leading to the leaf associated with the smallest PED. To find this leaf, SD traverses the tree in a depth-first manner. Complexity reduction (compared to an exhaustive search) is achieved by pruning those nodes from the tree for which d i s (i) is larger than a radius r > 0. We use a technique known as radius-reduction [6] , which initializes the radius to r ← ∞ (prior to detection) and performs the radius Ch. est, QR decom. update r ← d 1 s (1) whenever a leaf-node s (1) is reached. In the following, we refer to the condition d i s (i) < r as the sphere constraint (SC).
Sphere decoder

Re-order buffer
B. Wideband MIMO Receiver Architecture
In wideband MIMO systems, such as IEEE 802.11n, a single SD core is usually insufficient to support both the bandwidth and the (error-rate) performance requirements, even for advanced process technologies. Hence, multiple SD-cores are necessary to meet the associated throughput and performance requirements.
Architecture overview: The high-level system architecture of a wideband MIMO receiver based on SD is illustrated in Fig. 2 . The data flow starts with the OFDM demodulation. During a training phase, received training symbols are delivered to a MIMO preprocessing unit. This unit estimates the channel matrices H and performs necessary pre-computations on H (i.e., the QR decomposition). During the data phase, the demodulation unit and the MIMO preprocessing unit forward the received vectors and the results of the pre-computation of the corresponding channel matrices to the MIMO detector at a constant arrival rate, which is essentially given by the communication bandwidth of the system. In the MIMO detector, the information required to decode a symbol is first queued in a FIFO. A scheduler reads the entries of the FIFO and forwards them to the next idle SD core together with a runtime constraint (i.e., a constraint on the number of nodes that are allowed to be examined by SD). When the FIFO fills up, the runtime constraints are reduced to ensure that no data is lost. Note that this reduction degrades the quality of the detection. 1 The outputs from the N SD cores are collected and reordered since the variable runtime may cause decoded symbols to arrive out-of-order. The reordered symbol estimates are then forwarded to the channel-decoding block.
Implications on SD core optimization: With the above described architecture, the average decoding effort, i.e., the number of visited nodes that can be allocated for decoding of each symbol is determined by
where B denotes the bandwidth of the system (i.e., the arrivalrate of the symbols to be decoded), T c is the clock period of a SD core (assuming one node in the tree is checked in each cycle), and N is the number of SD instances. At the system-level, the performance/complexity tradeoff can now be adjusted by the choice of N . The resulting area of such a system corresponds to A tot = NA SD , where A SD denotes the silicon area of a single SD core. For large N , the overall silicon area for a guaranteed number of visited nodesΦ that can be used for decoding received symbols, is given by
From (6), it follows that if multiple SD cores are necessary to meet the performance requirements of a wideband MIMO system, the focus for the optimization of the SD core shifts from minimizing the area or maximizing the throughput to minimizing the corresponding area-delay (AT-)product ρ SD .
III. VLSI ARCHITECTURE OF HARD-OUTPUT SD On the first level of hierarchy, the proposed SD architecture is similar to the one proposed in [3] . In the following, we summarize this architecture and describe a number of optimizations that result in an improved AT-product compared to previously reported SD-implementations.
A. High-level Architecture Fig. 3 shows the high-level block diagram of the proposed SD circuit. The design is comprised of a metric computation unit (MCU), a metric enumeration unit (MEU), an SC check unit, a level-select multiplexer, and a cache.
The MCU is responsible for the forward-iteration of the depth-first tree-traversal. In the implementation [3] , this forward iteration includes the sequential evaluation of (5) and the computation of the PED in (4) . In the present circuit (cf. Fig. 4 ), a slicer-unit performs a decision on the nearest constellation point and the MCU computes b i (instead of b i+1 ) in parallel to the PED of level i + 1 as proposed in [7] . The resulting b i is then used in the next iteration (provided that the SC is met); this optimization reduces the critical path without the need for additional hardware.
The MEU operates in parallel to the MCU. While the MCU is processing a node on layer i, the MEU selects the next-best constellation point on layer i + 1 according to an enumeration scheme and computes its PED. Hence, once the SD algorithm needs to move upward in the tree, the MCU can directly start the next forward iteration as all required intermediate results have already been computed beforehand by the MEU. The RTL architecture of the MEU (cf., Fig. 4 ) is similar to the one of the MCU. However, the slicer-unit that determines the closest CP is replaced by an enumeration unit that determines which CP should be considered next on layer i + 1.
The cache stores intermediate results for each level computed by the MEU and the MCU. The SC check is carried out immediately after the computation of the new PEDs. MEU, MCU, level cache, and the result of the SC check decide on which layer the SD algorithm proceeds next. If a leaf that fulfills the SC is found, the radius is updated. In this case an additional clock cycle is necessary, as the PEDs in the level cache need to be checked against the new radius.
B. Enumeration Strategy
The enumeration strategy (implemented by the enumeration unit in the MEU) defines the order in which the children of a node are visited. Radius reduction (cf. Section II-A) is most efficient in combination with the Schnorr-Euchner (SE) enumeration [8] , which visits the children of a node in ascending order of their PEDs. An important advantage of this enumeration strategy is that leaves that are more likely to lead to the ML solution are found early, which expedites the pruning of the tree. Moreover, enumeration of the children of a node can terminate as soon as the first child violates the SC.
Implementation of Schnorr-Euchner enumeration: For each visited node, SE enumeration is comprised of two types of operations: The first operation is to initialize the enumeration of the children by identifying the child associated with the smallest PED. This task can easily be accomplished by comparing b i+1 in (5) to a number of decision boundaries, i.e., by performing a slicing operation in the MCU of Fig. 3 . The second type of operation is to enumerate the remaining children in ascending order of their PEDs, which is a nontrivial task for complex-valued constellations. In order to minimize the AT-product ρ SD of the SD core, an efficient implementation of this operation is of paramount importance.
Exhaustive enumeration: Exhaustive enumeration is a straightforward (but rather inefficient) solution to perform SE enumeration [3] . The idea is to first compute the PEDs of all children of a node. During enumeration, a min-search (limited to the subset of children that have not yet been visited) identifies the next child. The main drawbacks of this solution are i) the area requirement to compute the PEDs of all children of a node, ii) the need to store them in the cache, and iii) the fact that a min-search is costly in terms of area and timing, especially for higher order constellations.
Subset enumeration: More elaborate solutions for SE enumeration were presented in [3] , [6] , and [9] . The main idea of these approaches is to divide the complex-valued (twodimensional) constellation into one-dimensional subsets which only require to compute and store one PED per subset and consequently also reduce the complexity of the min-search. Unfortunately, the number of required subsets gets large for higher-order modulation schemes, which has considerable impact on circuit area and timing of implementations supporting 64-QAM.
C. Approximate SE Enumeration
The goal of considering approximations to SE enumeration is to perform the enumeration without the need for computing, caching, and comparing PEDs for multiple candidate CPs on the same level. Such, approximations based on geometrical considerations were first proposed in [10] and [11] . The basic idea is to store predefined enumeration sequences in one or multiple look-up tables (LUTs). A fixed sequence is chosen based on several geometric rules that analyze the position of the received point b i+1 relative to the closest CP. The accuracy of these techniques can be adjusted by the number and complexity of the associated selection criteria together with the number of predefined LUTs. The major drawback of this approach is the rather poor scaling behavior of the size of the LUTs required for higher-order modulation schemes.
Ordered l f ∞ -Norm Enumeration: In the following, we describe an approximation to SE enumeration that can be implemented efficiently in hardware without the need for LUTs and therefore, scales well to higher-order constellations (i.e., constellations including and beyond 64-QAM).
Inspired by the l f ∞ -norm SD algorithm [3] , [12] , we define the l f ∞ -norm of a vector x according to x f ∞ = max{| (x)|, | (x)|}, where (x) and (x) denote the real and imaginary part of the entries of x, respectively. The starting point for the enumeration is trivially determined by the closest CP (in Euclidean distance). However, the CPs are enumerated 2 according to their l f ∞ -norm distance
To this end, the area around the closest CP is first subdivided into eight sectors as illustrated in the lower right corner of Fig. 5 . The sector containing b i+1 is identified with simple geometric rules to define the second CP in the enumeration and the direction for the ordered l f ∞ -norm enumeration. CPs with identical l f ∞ -norm form one-dimensional subsets. 2 We use the l f ∞ -norm only for enumeration, whereas the algorithm in [3] , [12] also uses it for distance computations. All nodes within the same subset are processed before the algorithm selects the next subset. In the example provided in Fig. 5 , the processing order of the one-dimensional subsets is illustrated by the leading number attached to each CP. Within each subset, zig-zag enumeration is applied around the CP closest to b i+1 ; this is illustrated by the trailing number in Fig. 5 . The members of each subset are returned in SE order and subsets are enumerated in order of increasing l f ∞ -norm. RTL architecture: For the RTL implementation, the abovedescribed enumeration algorithm can be split into two basic tasks: i) tracking of the position, size, and orientation of the linear subsets, and ii) zig-zag enumeration within the subsets and checking for the boundaries of the finite-size modulation alphabet. Both tasks can be implemented using simple combinational logic, comparators, and three counters. Hence, the required circuit complexity is low.
Impact on error-rate performance and number of visited nodes: Besides a reduction of the hardware complexity, the approximation to the SE enumeration has an impact on the number of visited nodes and on the (error-rate) performance of the SD algorithm. The reason for this impact lies in the fact that the approximation does not guarantee that the children of a node are always enumerated strictly in ascending order of their PEDs (only the first three CPs always correspond to the first three CPs obtained by SE enumeration). Hence, numerical simulations are performed to verify that the errorrate implementation loss due to the approximation of the enumeration is low and that the number of visited nodes does not increase substantially. Corresponding results 3 for hard-and soft-output SD are shown in Fig. 6 . It can be seen that the loss in terms of the coded frame-error rate (FER) performance is negligible and that the number of visited nodes with l f ∞ -norm 3 We consider coded (rate 2/3 convolutional code, constraint length 7, generator polynomials [133o 171o], and random interleaving across space and frequencies) MIMO-OFDM transmission with M R = M T = 4, 64-QAM (Gray mapping), 64 OFDM tones. One frame corresponds to 1536 coded bits. A TGn type C [13] channel model is used. We assume perfect channel state information at the receiver and employ minimum mean-square error sorted QR decomposition (MMSE-SQRD) [14] for SD-preprocessing. The SNR is per receive antenna. enumeration is slightly less (i.e., approximately 5 %) compared to exact SE enumeration.
D. Pipeline Interleaving
Pipelining cannot directly be applied to SD due to the firstorder feedback path present in the architecture. Nevertheless, symbol-wise pipeline interleaving can be used to shorten the critical path. The main idea of this approach is to process multiple (independent) symbol-vectors in parallel within the same circuit. This basic idea has already been suggested for SD [15] , [16] , but neither details on suitable locations of the pipeline registers, nor a discussion of the number of pipeline stages yielding the optimal AT-product has been provided. Fig. 3 and Fig. 4 show the location of the pipeline registers (in light grey) in the RTL architecture for three pipeline stages. The location was manually chosen to approximately balance the path delays between the pipeline stages and registerretiming during synthesis was allowed for further optimization. Besides adding the pipeline registers in the datapath, the level cache in Fig. 3 was extended to a ring-buffer in which each entry is associated with one of the symbols in the pipeline and corresponds to one instance of the original level cache.
IV. IMPLEMENTATION RESULTS AND COMPARISON
A. Results for Hard-Output SD
The AT-diagram in Fig. 7 shows the implementation results of hard-output SD with ordered l f ∞ -norm enumeration and pipeline interleaving with different number of pipeline stages 4 . The proposed architectures have been implemented with support for multiple modulation schemes (BPSK, QPSK, 16-QAM, and 64-QAM) and for up to four spatial streams.
The architecture with three pipeline stages achieves the best AT-product. But also the architectures with more than three pipeline stages come close to AT-optimality, whereas the architectures with fewer pipeline stages are clearly outperformed in terms of hardware-efficiency. For comparison, implementation 4 The results were obtained by synthesizing the RTL description in VHDL with different timing constraints. results of previous hard-output SD implementations are also included in Fig. 7 (the results are also summarized in Tbl. I). It can be seen that the proposed unpipelined hard-output SD architecture outperforms previous unpipelined designs by a least 23% in terms of area and by at least 28% in terms of clock frequency 5 . Furthermore, the AT-product [kGE/MHz] of the proposed architecture with pipeline interleaving is more than a factor two better than that of a previously reported implementation [15] with pipeline interleaving.
B. Application to Soft-Output STS-SD
The proposed enumeration scheme and pipeline interleaving can also be applied to soft-output SD. The corresponding architecture is based on the soft-output single tree-search (STS) algorithm proposed in [5] . Fig. 6 demonstrates that also for STS-SD, the FER performance loss due to the proposed l f ∞ -norm enumeration scheme is negligible and the average number of visited nodes is slightly reduced. Implementation results for soft-output STS-SD with the proposed l f ∞ -norm enumeration scheme are shown in Tbl. II and compared to previous soft-output detection implementations. The presented implementation is clearly superior in terms of area and clock frequency compared to the soft-output detector shown in [11] . The original implementation of soft-output STS-SD in [5] only supports 16-QAM modulation, which is the main reason for the smaller area in the unpipelined case. For hard-output SD, pipeline interleaving with three pipeline stages showed to be pareto-optimal. As the additional units required for STS-SD do not influence the critical path, STS-SD was also implemented with three pipeline stages. Tbl. II shows that the AT-product is improved by more than 30 % due to pipeline interleaving.
C. The Case for Multiple SD-Cores
In Section II, we argued that a single SD core is insufficient to meet the bandwidth and error-rate performance requirements of modern wireless communication standards a One GE corresponds to the area of a two-input drive-one NAND gate.
b Scaled from 0.25 μm to 0.13 μm by multiplying by 0.25/0.13.
c Davg denotes the average number of nodes used for block processing [4] .
such as IEEE 802.11n, where a throughput of 600 Mbps is required. From Tbl. I, we observe that hard-output SD meets the throughput requirement when early-termination and blockprocessing according to [4] are applied. For soft-output STS-SD, the number of visited nodes is significantly increased: from seven for hard-output SD to a least 100 for soft-output STS-SD 6 . To illustrate the necessity for multiple soft-output STS-SD cores, we hypothetically assume D avg = 100 for 64-QAM modulation. The throughput of one STS-SD core is then 92 Mbps. To fulfill the throughput requirement of 802.11n, up to seven STS-SD cores are required.
V. CONCLUSION
To meet the throughput and latency requirements of wideband systems (e.g., IEEE 802.11n) with sphere decoding (SD), multiple detection cores need to be instantiated. Therefore, the efficiency or the area-delay product of a single SD core needs to be optimized. To this end, two techniques, namely ordered f ∞ -norm enumeration and pipeline interleaving, have been proposed. The enumeration scheme significantly reduces circuit area and the critical path-delay. Simulations also showed, that the performance loss due to the new enumeration scheme 6 A in-depth evaluation for the number of visited nodes for 64-QAM goes beyond the scope of this paper and involves optimizations of different parameters (e.g., clipping level, run-time constraint, SNR requirement). We expect, based on simulation results, a hundred to a few hundreds of nodes to be visited. For 16-QAM, the numbers have been presented in [5] .
is negligible. With pipeline interleaving multiple independent symbol vectors are processed in parallel and the available hardware resources are better exploited. A design-space exploration with different number of pipeline stages revealed that the architecture with three pipeline stages is the most efficient. With these two approaches, the area-delay product is improved by almost 50 % compared to that of other SD implementations. Finally, we showed that both approaches can also be applied to soft-output SD.
