Fixed-complexity sphere decoder (FSD) is one of the most promising techniques for the implementation of multiple-input multiple-output (MIMO) detection, with relevant advantages in terms of constant throughput and high flexibility of parallel architecture. The reported works on FSD are mainly based on software level simulations and a few details have been provided on hardware implementation. In this paper, we present the study based on a four-nodes-per-cycle parallel FSD architecture with several examples of VLSI implementation in 4 × 4 systems with both 16-QAM and 64-QAM modulation and both real and complex signal models. Implementation aspects and details of the architecture are analyzed in order to provide a variety of performance-complexity trade-offs. We also provide a parallel implementation of log-likelihood-ratio (LLR) generator with optimized algorithm to enhance the proposed FSD architecture to be a soft-input soft-output (SISO) MIMO detector. To our best knowledge, this is the first complete VLSI implementation of an FSD based SISO MIMO detector. The implementation results show that the proposed SISO FSD architecture is highly efficient and flexible, making it very suitable for real applications.
implementation [15] . The recently proposed soft-output single tree search (STS) SD is also based on the depth-first SESD algorithm. It can be implemented efficiently, and achieves excellent throughput at medium to high SNR values, but it is much less efficient at low SNR region [6] . Furthermore, it is difficult to map these algorithms to highly parallel architectures, because of the natural sequential search order.
The sub-optimal algorithms, such as the K-best and the FSD, could also be improved into soft-output versions, which guarantee constant throughput at the cost of a certain performance loss. The K-best algorithm keeps a certain number (i.e. K ) of best nodes in each level while traversing the tree, by applying sorting operations, which require additional hardware resources [9] . The FSD algorithm also achieves constant throughput but with relatively lower hardware complexity. The most outstanding feature of FSD is the regular tree traversal path, which enables the design of highly-efficient parallel architectures [10] .
The reported works on FSD are mainly based on software level simulations and most of investigations on hardware are implemented on FPGA devices, such as the FPGA prototypes with pipeline architectures reported in [16] and [17] . These FPGA prototypes achieve very high throughput, but at a very high cost of occupied hardware resources, making them impractical in real applications.
In this paper, we improve a four-nodes-per-cycle parallel FSD architecture, which is implemented in real signal model as reported in [18] , into complex signal model, and present a panorama of FSD implementations considering different types of modulation and both real and complex signal models, in order to provide a variety of performancecomplexity trade-offs for different practical situations. We find that the four-nodes-percycle FSD architecture can be efficiently implemented, especially in complex signal model, in terms of throughput per area unit.
The LSD acts only as a list generator in the SISO MIMO detection and an LLR generator is required to iteratively refine LLR for each codeword bit based on the candidate list and on the feedback information received from the channel decoder [9] [12] [13] . The LLR generator usually shows comparable computational complexity as LSD, but it did not attract enough attention by researchers. In this paper we also provide a parallel implementation of LLR generator in order to construct an efficient SISO MIMO detector combined with the four-nodes-per-cycle FSD architecture. This paper is organized as follows: Section 2 introduces the system model of sphere decoders; Section 3 details the four-nodes-per-cycle architecture together with several implementation aspects and the parallel implementation of LLR generator; Section 4 shows the implementation results and compares the overall performance among the FSD implementations and other SD implementations; Section 5 concludes the paper.
System Model and Sphere Decoding Algorithm
The diagram of iterative MIMO decoding system with N t transmit antennas and N r receive antennas is shown in Fig. 1 . The source bits are firstly encoded by a channel encoder, which could be for instance a Turbo encoder [19] or an LDPC encoder [20] . Then the coded bit stream is interleaved in order to overcome correlated channel noise. The interleaved bits are mapped into an N t -dimensional transmit signal vector s = [s Nt−1 , . . . , s 1 , s 0 ]
T .
Each symbol of the vector is chosen independently from a complex constellation Ω with M binary bits per symbol, i.e., |Ω| = 2 M . The transmission rate is denoted as R = N t M bits per channel use (bpcu). The received vector can be denoted as
where H is the N r × N t complex channel matrix, assumed to be perfectly known at the receiver through channel estimation, n = [n Nr−1 , . . . , n 1 , n 0 ] T is an N r -dimensional complex additive i.i.d. (independent and identically distributed) white Gaussian noise vector.
In Fig. 1 , the SISO MIMO detector is employed to generate soft information which is required by the concatenated channel decoder. The SISO MIMO detector could be an LSD or employ other MIMO detection algorithms, such as the STS SD or the minimum mean square error (MMSE) decoder [21] . In this paper we assume LSD is employed, which generates a candidate list in order to compute LLRs. Soft information is exchanged between the inner MIMO detector and the outer channel decoder through a pair of interleaver and deinterleaver. After a certain number of iterations, decoded bit stream is available through hard decision depending on the required throughput or bit-error-rate (BER) performance [11] .
In the transmit symbol vector constellation Ω Nt there exists an ML solution that can be expressed as
s M L can be obtained through exhaustive search in a small MIMO system, such as a 4 × 4 system with QPSK modulation (totally 2 2×4 = 256 possible solutions) [22] . But for a large system, it is impractical to perform exhaustive search, because of the very large number of possible solutions to be examined [23] . For example in a 4 × 4 system with 16-QAM modulation, there are 2 4×4 = 65, 536 possible solutions. The sphere decoders are therefore proposed to reduce the computational complexity by formulating the calculation of (2) as a tree visit problem and applying certain pruning criteria to decrease the number of visited nodes [24] .
A likelihood metric, generally in form of partial Euclidean distance (PED), is evaluated for each visited node in order to prune the tree or determine the traversal path. For hardoutput sphere decoders, the transmit vector with minimum PED is chosen as the final solution [4] , whereas for LSD, a certain number of visited tree leaves with the lowest PED values are sent to an LLR generator to compute the soft output information [10] .
Complex Signal Model
Before the decoding stage, the preprocessing with QR decomposition is applied as
where Q is N r × N r complex orthogonal matrix, R is N t × N r complex upper triangular matrix with positive diagonal elements. The QR decomposition is of critical importance in many MIMO detection algorithms to reduce the complexity of receivers and to get better decoding results [25] .
Then the PED for each visited node is given by
where
|e i | 2 is the increment of PED in the i th level and y ZF is the zero-forcing solution [4] .
Real Signal Model
As an alternative to the described complex signal based processing, the SD algorithm can be performed also in a real-valued signal model, by transforming the complex channel matrix into real values:Ĥ
where Re{x} and Im{x} denote the real and imaginary parts of x, respectively.
After the transformation, the N r ×N t dimensional channel matrix H turns to 2N r ×2N t dimensional. Then the QR decomposition is applied on real values:
whereQ is 2N r × 2N r real-valued orthogonal matrix,R is 2N t × 2N r real-valued upper triangular matrix with positive diagonal elements.
The extended matrix dimensions imply that the number of tree levels is doubled and the PED calculation in (4) ∼ (7) can be expressed as
For each vector in the transmit constellation, the calculated PED values are same in either real and complex signal models. However, it should be noticed that the choice of signal model could significantly impact on the BER performance, the throughput and the hardware complexity of sphere decoders.
Depth-first SESD
Before describing the FSD, firstly we look at the fundamental depth-first SESD. The SESD performs tree traversal in depth-first style and prunes the tree by comparing the PED of each visited node with current radius of the sphere, which is set to infinity at the beginning and is updated during the traversal. The tree traversal moves downwards from the top level, as shown in Fig. 2 , with the example of 2 × 2 system with QPSK modulation; in this case, the tree has two levels and each node in the top level has four child nodes. The dark nodes numbered as 1 to 6 denote the visiting order. The SESD starts the depth-first tree traversal by choosing firstly the child node with minimum PED (node 1) and moves downwards. When reaching the bottom level, the minimum PED of the child nodes is compared with the current radius. If the PED is smaller than the radius, the radius is updated with the value of the PED (nodes 2) and the other child nodes are discarded. Then the SESD returns to the sibling nodes in the upper level and performs the same procedure. If the PED of the node is smaller than the current radius (node 3), the tree traversal goes downwards (node 4), otherwise the whole branch under the node is pruned (nodes 5 and 6). By applying the tree pruning, the total number of visited nodes is significantly reduced compared with the exhaustive search, while still yielding ML performance.
The soft-output SESD performs the similar depth-first tree traversal as the hardoutput SESD. The difference is that the radius will not be updated until a required number of candidates are obtained, and sorting operations are necessary to insert new candidates into the list. After the tree traversal, the final candidate list is sent to an LLR generator to compute the soft information.
The SESD needs to reach the bottom tree level immediately, in order to update the radius and perform tree pruning: the depth-first sequential order makes it hard to adopt parallel processing architectures. The resulting throughput is variable and tends to drop down significantly at low SNR region. Furthermore, the performance becomes much worse when the number of visited nodes is bounded with the purpose of reducing the detection complexity and latency in real applications [15] . In order to obtain a constant throughput, a constant number of nodes must be visited. Sub-optimal detection methods have been
proposed to achieve this result and the FSD technique is one of the most interesting ones.
FSD Architecture
The FSD algorithm is also based on PED calculation but does not need to update a radius to prune the tree. Both the number and the positions of the nodes to be visited in each level are pre-determined before decoding, depending on the so-called node distribution.
According to its definition, the processing complexity of FSD is fixed, therefore yielding a constant throughput. Furthermore, because it is not necessary to reach the bottom level immediately to update the radius, the traversal order is very flexible, which can be either in depth-first or in breadth-first style. One of the most outstanding advantages of the FSD is the flexibility to adopt parallel architectures for its deterministic and regular tree traversal path. Several pipeline architectures implemented on FPGA devices have been reported to achieve very high throughput, however, they also consume very large amount of hardware resources [16] [17]. However we find that highly area efficient implementation of FSD is possible with the four-nodes-per-cycle parallel architecture.
Four-nodes-per-cycle Architecture
The four-nodes-per-cycle FSD architecture proposed in [18] employs breadth-first tree traversal in order to shorten the critical path. The three major computational tasks, The b i units are responsible for calculating b i in (7). When computing b i , because both real and imaginary parts of s i is chosen from a small set {+3, +1, −1, −3} for 16-QAM modulation, the multiplications between R i,j and s i can be transformed into additions between ±2Re{R i,j }, ±Re{R i,j }, ±2Im{R i,j } and ±Im{R i,j } (with additional ±4Re{R i,j }, ±8Re{R i,j }, ±4Im{R i,j }, ±8Im{R i,j } for 64-QAM modulation). The values of ±2Re{R i,j } and ±2Im{R i,j } can be easily obtained through left shifting operation.
In order to shorten the delay path, all these values are compressed into a Wallace compress tree of carry save adder (CSA), which is followed by a common ripple carry adder (RCA) [26] .
The DE units are employed to select child nodes to be visited. Because all the candidates are derived from a common parent (i.e. with the same d i+1 ), the solution is to compare |e i | among all the child nodes and choose the node with minimum |e i |. Since b i is already calculated in the previous cycle, the task of DE unit is simply comparing 
Complexity of FSD
The complexity of LSD is usually mentioned in terms of the number of visited nodes, performing sphere decoding directly on the complex constellation is more efficient in VLSI implementations [4] . To the contrary, Myllylä et al. show that the real valued algorithms are less complex and feasible for practical LSD implementation [15] . However, a quantitative comparison based on fully synthesized hardware systems and extended to several MIMO systems between the two alternatives is currently not available. We will see later that the complex implementation achieves higher efficiency in terms of throughput per area unit, which is more meaningful compared with considering only on the Silicon area.
Some channel ordering algorithms can be applied to improve the performance. We find that even a simple sorted QR decomposition (SQRD) improves the performance significantly. The SQRD algorithm reorders the sequence of detection to minimize the risk of error propagation, by maximizing |R i,i | for i = N t − 1, N t − 2, . . . , 0 [29] . Therefore the transmit antennas with strongest signal is assigned to the levels closer to the root of the tree [25] .
In addition to the signal models, the node distribution also affects both performance and complexity of FSD. Although it is difficult to provide a comprehensive analysis of the node distribution to achieve optimal performance, a general method proposed in [27] could be used for an arbitrary constellation and for any number of antennas. The number of visited child nodes of each parent node in one level can be chosen from a small set {1, N b }, where N b is the number of branches per node. In the top levels, a full search is performed by visiting all the N b branches, whereas in the lower levels, only one branch is chosen as survivor [28] . When combined with the SQRD, this method guarantees that the nodes with highest signal-to-noise ratio (SNR) are firstly visited. In this paper we adopt this kind of node distribution together with the SQRD to maximize the BER performance, with an exception for the case with 64-QAM modulation and list size = 128.
It is implemented with the node distribution {1, 1, 2, 64}, which also follows the basic rule that expands the paths in the top levels.
Parallel LLR Generator
After the sphere decoding, the candidate list is forwarded to the LLR generator. Then the LLR for each coded bit is evaluated based on the candidate list and the a-priori information vector L A coming from the outer channel decoder. Let L E,k denote the extrinsic information, i.e., LLR, for the k th bit in the coded bit vector, k = R − 1, . . . , 1, 0.
The max-log approximation of L E,k calculation is formulated according to [11] L E,k ≈ 1 2 max
where X k,+1 and X k,−1 represent the sets of vector x i having x i,k = +1 and x i,k = −1 respectively, σ 2 is the noise variance, x The whole computation task shows considerable complexity O(P R 2 ) in terms of additions. The LLR generation methods reported in [12] and [30] are soft-output only and do not support iterative decoding with channel decoder, whereas the strategy of mixed STS and LLR generation reported in [6] is soft-input soft-output but can not be applied to the LSD. In this paper we optimize the algorithm in (16) by reusing intermediate data and reduce the complexity of LLR generation to O(P R).
where i = P − 1, . . . , 1, 0.
When evaluating LLRs as expressed in (16) 
Therefore the approach for reducing complexity is to get T i for each of the P candidates at the beginning of processing, with only P accumulations required for all T i . Then T i could be reused in the following procedure for evaluating LLRs for different coded bits, and get T k i by subtracting x i,k L A,k from T i . Pseudo code of the optimized LLR generation is given in Fig. 5 . For simplicity, we denote
Based on the optimized LLR generation algorithm, we designed several compact functional blocks in order to build an efficient LLR generator, which is highly parallel and scalable to improve further the throughput.
The architecture of the LLR generator is shown in Fig. 6 . Fig. 5 . The difference between LLR_POS and LLR_MIN is the final output L E,k .
Implementation Results and Analysis
We improve the four-nodes-per-cycle FSD architecture described in [18] We can see that both FSD and K-best SD exhibit performance gaps compared with the optimal SESD algorithm. The gaps can be compensated in certain degree by increasing nodes, but the K-best SD with K = 12 needs to visit (16+3×12×16) = 592 nodes, which is much higher compared with the FSD. Therefore we conclude that among sub-optimal algorithms, the FSD algorithm is more efficient than the K-best algorithm.
The five cases of FSD implementation are synthesized on the same 0.13 µm CMOS technology. The Silicon complexity and breakdown is shown in Table 1 . The Silicon area is measured in terms of gate equivalents (KG). We can see that the Silicon area of complex Table 2 shows the available throughput of the five cases. Although the implementations in real and complex signal models achieve the same processing speed in terms of visited nodes per second (1.60 G in the four-nodes-per-cycle architecture and 3.08 G in the eight-nodes-per-cycle architecture), there are large gaps between the throughput in terms of coded bits per second, because the total numbers of visited nodes are quite different between real and complex signal models. Hardware efficiency is given in terms of throughput per area unit. We can see that either for 16-QAM or 64-QAM modulation, implementations in complex signal model achieve higher values with the same list size.
Communications performance and implementation efficiency of the five implementations of FSD are shown in Fig. 8 . The BER performance is given through simulations in 4 × 4 system, coupled with a four state, 1 / 3 code rate Turbo decoder, which executes 8 decoding iterations [30] . SQRD is applied to all the five cases and two iterations between soft-output FSD and Turbo decoder are performed. The proposed LLR generator is also synthesized on the 0.13 µm CMOS technology. It occupies a Silicon area of 22 KG, at 500 MHz clock frequency [31] . In this implementation When combined with the four-nodes-per-cycle FSD architecture, a highly efficient SISO MIMO detector is available, which costs only 49.5 KG Silicon area for 4×4 system with 16-QAM modulation.
The implementation efficiency is compared with several recently published SISO MIMO detectors in Table 3 . The FSD 2 implemented in complex signal model achieves the highest efficiency in terms of throughput per area unit. It achieves much higher throughput compared with the SISO K-best SD reported in [9] and the SISO STS reported in [6] , which suffers variable throughput. The SISO MMSE detector achieves very high throughput at the cost of 410 KG Silicon area, which reduces its efficiency.
Conclusion
In this paper we improve a recently proposed four-nodes-per-cycle FSD architecture into 
