Abstract-In this paper, a recursive Forward-Backward (F-B) trellis algorithm is proposed for soft-output MIMO detection. Instead of using the traditional tree topology, we represent the search space of the MIMO signals with a fully connected trellis and a Forward-Backward recursion is applied to compute the a posteriori probability (APP) for each coded data bit. The proposed detector has the following advantages: a) it keeps a fixed throughput and has a regular datapath structure which makes it amenable to VLSI implementation, and b) it attempts to maximize the a posteriori probability by tracing both forward and backward on the trellis and it always ensures that at least one candidate exists for every possible transmitted bit x k ∈ {−1, +1}. Compared with the soft K-best detector, the proposed detector significantly reduces the complexity because sorting is not required, while still maintaining good performance. A maximum throughput of 533Mbps is achievable at a cost of 576K gates for 4 × 4 16-QAM system.
I. INTRODUCTION
The depth first Sphere Decoder (SD) [1] [2] and the breadth first K-best [3] [4] algorithms have been proposed by researchers to achieve maximum a posteriori (MAP) decoding for coded MIMO systems. The depth first SD algorithm has non-deterministic complexity and variable throughput which makes it sensitive to the channel conditions. The performance of the depth first SD with a small list size suffers degradations due to the inaccurate and especially the infinite log likelihood ratio (LLR). On the other hand, the K-best algorithm has an advantage in hardware implementations since it has fixed complexity, throughput and latency. However, when K is large, the complexity of the K-best algorithm dramatically increases because a large number of paths have to be extended and sorted.
In this paper, a new efficient Forward-Backward (F-B) trellis searching algorithm and its VLSI architecture is introduced for high throughput soft-output MIMO detection. It is based on a suboptimal double-direction trellis traversal algorithm. This algorithm always ensures that a full Euclidean distance will be found for every possible transmitted bit, therefore it avoids the LLR clipping issues that both depth-first SD and K-best detectors have. The soft K-best algorithm usually does not perform backward tree traversal which limits its performance due to the inaccurate LLR generation. In our approach, we add a new feature by traveling both forward and backward in the trellis to generate a more accurate LLR for each coded bit. This low-latency detector offers a good solution for high throughput MIMO detection.
II. SYSTEM MODEL
We consider a coded MIMO system with M transmit antennas and N receive antennas. The MIMO transmission can be modeled as:
where
T is a transmitted vector, y is a N × 1 received vector and n is a complex Gaussian noise vector with variance σ 2 . The symbol vector s is obtained using the mapping function s m = map(x), m = 0, ..., M −1 where x is an M c ×1 vector of data bits, and M c is the number of bits per constellation symbol.
The soft-output detector is to compute the APP L-value of the bit
where L A and L E denote the a priori L-value and extrinsic Lvalue, respectively. Using Max-Log approximation, L D (x k |y) can be simplified to [4] [5]
where X k,±1 = {x|x k = ±1}, and
Using QR decomposition according to H = QR, where Q and R refer to an N ×M unitary matrix and an M ×M upper triangular matrix, respectively, we can write (4) as
whereŷ = Q H y, and C is a constant that does not affect the minimizations in (3).
Solving (3) requires exhaustive search for each bit x k . In order to reduce the complexity, conventional SD or K-best detectors can be used to generate a list of L candidates which have the smallest Euclidean distance to approximate (3). 
III. PROPOSED F-B MIMO DETECTION ALGORITHM
We represent the search-tree with its equivalent trellis diagram as shown in Fig. 1 and apply the recursive ForwardBackward algorithm to solve (3). Fig. 1(b) shows a Q (Q = 2 Mc ) state fully connected trellis, numbered from 0 to Q − 1. There are M steps in the trellis diagram numbered from 0 to M − 1, note that k −1 corresponds to the tree root node in Fig. 1(a) . The trellis is fully connected in that each state/node has Q input paths and Q output paths. A path in the trellis can be represented by a state sequence {q 0 , q 0 , .., q M −1 }, which indicates the trellis path starting at state q 0 , passing through every state q k at time k, and terminating at state q M −1 .
To describe the baseline F-B algorithm, for now, we assume there is no a priori L-value L A for the detector. Hence solving (3) is equivalent to find the minimum Euclidean distance
for each coded bit x k . The F-B algorithm is described as follows:
A) Forward Recursion:
Let α k (q) be the state metric, which represents the partial Euclidean distance, for state q (q = 0, 1, ..., Q − 1) at step k (k = 0, 1, ..., M − 1). Let γ(q , q) denote the branch metric from state q to q. Let the history of the best forward path for state q at step k be stored in an array φ 
where the branch metric
The complex transmitted symbol s j in (9), is formed by using the constellation mapping function:
After the minimum α k (q) for each state q is found, the forward history path array φ q k is updated bỹ q = argmin
[2]
[3] A forwarding recursion example for a 4 × 4 QPSK system is illustrated in Fig. 2 , where the forward history path array φ q k is shown for each state node. For simplicity, only the survivor path for each state is shown. This algorithm can be used to find the ML path for the trellis, which is highlighted with a bold line. However, our goal is to find the minimum Euclidean distance Ω for every coded bit. Except for the first antenna (k=3), not every state node has a fully extended path in Fig. 2 . This is because of the greedy path selection algorithm: only the best path will be retained for each node. In order to find a minimum full Euclidean distance for every state node in each antenna, a backward recursion is performed after the forward recursion.
B) Backward Recursion:
Similarly, let β k (q) be the backward state metric. The backward recursion, which searches from antenna 0 to antenna M − 1, is summarized as follows:
where the backward branch metric
The symbol s j in (14) is formed by using the forward path array φ q k and the incoming state q Fig . 3 shows the trellis diagram after the backward recursion. The dotted and the solid lines denote the survivor pathes for the backward recursion and the forward recursion, respectively. The backward β metric is an approximation for the partial Euclidean distance (PED). Combining forward α metric and backward β metric will give us a good approximation of the minimum Euclidean distance for each trellis node. If we examine the trellis diagram after the backward recursion, every state node now has a fully extended minimum path metric Ω that can be used to generate the APP L-value. For pipelined implementation, the APP L-value can be generated in parallel with the backward recursion as shown in Fig. 6 .
B.2) LLR Generation
where Ω k (q) is an approximation of the minimum Euclidean distance by combining the forward and backward metric for state q at step k, and
is the APP L-value for the j-th bit of the transmitted data vector x M−1−k , and q <j> = ±1 is a sub set of {q} (q = 0, 1, ...Q − 1) with its j-th bit equal to ±1 (0 ≤ j ≤ log 2 (Q) − 1).
IV. SIMULATION RESULTS
To evaluate the detection performance, we consider 4×4 16-QAM and 64-QAM MIMO systems (the channel matrices are assumed to have independent Rayleigh fading distribution). In the simulation, the soft-output of the detector is fed to a length 2304, rate 1/2 LDPC decoder [6] , which performs up to 15 iterations. Fig. 4 and Fig. 5 show the bit error rate performance for the proposed F-B detector and the soft K-best complex detector with different K values. For the 4×4 16-QAM system, our F-B detector outperforms the K-best detector for K=16 and 32, and achieves similar performance compared with K=64. The same trend has been observed for the 4 × 4 64-QAM system where our F-B detector outperforms the soft K-best detector with K=32, 48 and 64. Fig. 6 shows the proposed architecture based on the F-B algorithm. The proposed architecture is very suitable for VLSI implementation because it has a regular data flow, fixed complexity , and fixed throughput and latency. Compared with the soft K-best detector, our architecture has less complexity since no sorting is required. Only finding the minimum value is required for the proposed architecture. Therefore, the critical path of the F-B detector would be shorter than the K-best detector. The latency is also reduced.
V. ARCHITECTURE DESIGN
A tile chart is used to represent the detector data flow, which is shown in Fig. 7 . The X-axis represents the MIMO symbol sequences and the Y-axis represents the decoding time. In  Fig. 7 , the forward recursion (F) is followed by the backward recursion (B), and the LLRs are generated in parallel with the backward recursion. 
A. Partial Euclidean Distance Computation Unit
The main computation units of this detector are the α and β units. Since α and β units have very similar structures, we use node processing unit (NPU) hereinafter to represent them. NPU is responsible for calculating the partial Euclidean distance (PED) and finding the minimum PEDs among all the candidates. Because of the upper triangular property of the R matrix, the PED can be computed in a recursive way which is shown below:
where d M is initialized to be 0, and d 0 is the full Euclidean distance.
Since the trellis structure is fully connected, each state node needs to compute Q PEDs. Fig. 8 shows the architecture of the partial Euclidean distance computation unit (PEU). For simplicity, we assume a QPSK modulation scheme with M transmit and receiving antennas. In Fig. 8 , SADD stands for shift and add which implements R i,j S j , where Cx (x = 0, 1, ..Q − 1) are the constant constellation points. For the QPSK scheme, Q = 4, new partial Euclidean distances (NPEDs) are computed and are sent to the compare and select unit for further processing. 
B. Compare and Select Unit
The compare and select (CSU) unit, which is shown in Fig. 9 , is used to find the minimum PED from Q input NPEDs. 
C. Node Processing Unit
In each step of the trellis algorithm, Q state nodes are working independently and therefore can be processed in parallel. By instantiating Q = 4 PEUs and CSUs, the top level node processing unit (NPU) for QPSK systems is shown in Fig. 10 . This is an iterative hardware architecture which implements (16). And the latency for one iteration is 3 cycles.
D. Architecture for Higher order Modulation Systems
We have shown the hardware architecture for QPSK systems, now we will extend it for higher order modulation schemes such as 16-QAM. The PEU-E and CSU-E in Fig. 11 are extensions of the PEU and CSU by replicating the hardware four times to support 16-QAM system. The PEU-E unit is used for computing 16 branches metrics. And the CSU-E unit is used for selecting the minimum PED from 16 candidates.
Based on the PEU-E and CSU-E units, the node processing unit (NPU) for a 16-QAM system has a very similar architecture as the QPSK system. As shown in Fig. 12 , 16 PEU-Es and CSU-Es are instantiated so that 16 nodes can be processed in parallel. The latency for each iteration remains to be 3 cycles. Therefore, the throughput for a 16-QAM system will be increased to 2 times that of the QPSK system. Table I shows the hardware complexity, detection throughput, and latency analysis for 4×4 QPSK and 16-QAM systems. The gate count estimation is based on a TSMC 65nm standard cell CMOS library. The highest clock frequency that the detector can achieve is about 400MHz. The decoding latency for a 4 × 4 system is 3 × M = 12 cycles. Table II compares the detection throughput and hardware complexity of the proposed F-B solution versus two hardware implementations from the literature: depth-first soft sphere detector with 256 search operations (fclk=122.88MHz) from [1] , and soft K-best detector (fclk=200MHz) from [4] . In [4] , a real QR decomposition is used with a small K=5. Based on the simulation results in Fig. 4 , our solution has a better BER performance than [4] and can achieve a faster throughput because we limit the number of sorting operations which is very expensive in the hardware implementation. On the other hand, at a cost of more hardware resources, the depthfirst detector in [1] has a better BER performance than our solution. However [1] has a limited throughput because of the large number of sequential searching operations and the most undesired feature of [1] is its variable throughput at different SNR levels. Our architecture provides a good solution in between the depth-first detector and the K-best detector. 
E. Hardware Complexity and Throughput Analysis

VI. CONCLUSION
We propose a new MIMO detector architecture based on the Forward-Backward recursion algorithm. This scheme can achieve very high throughput and can be easily parallelized. Both throughput and latency is deterministic, hence it is very suitable for hardware implementation.
VII. ACKNOWLEDGEMENT
This work was supported in part by Nokia and by NSF under grants CCF-0541363, CNS-0551692, and CNS-0619767.
