A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm by Sun, Yang & Cavallaro, Joseph R.
A New MIMO Detector Architecture Based on A
Forward-Backward Trellis Algorithm
Yang Sun and Joseph R. Cavallaro
Department of Electrical and Computer Engineering
Rice University, Houston, TX 77005
Email: {ysun, cavallar}@rice.edu
Abstract—In this paper, a recursive Forward-Backward (F-B)
trellis algorithm is proposed for soft-output MIMO detection.
Instead of using the traditional tree topology, we represent the
search space of the MIMO signals with a fully connected trellis
and a Forward-Backward recursion is applied to compute the
a posteriori probability (APP) for each coded data bit. The
proposed detector has the following advantages: a) it keeps a
ﬁxed throughput and has a regular datapath structure which
makes it amenable to VLSI implementation, and b) it attempts
to maximize the a posteriori probability by tracing both forward
and backward on the trellis and it always ensures that at
least one candidate exists for every possible transmitted bit
xk ∈ {−1,+1}. Compared with the soft K-best detector, the
proposed detector signiﬁcantly reduces the complexity because
sorting is not required, while still maintaining good performance.
A maximum throughput of 533Mbps is achievable at a cost of
576K gates for 4× 4 16-QAM system.
I. INTRODUCTION
The depth ﬁrst Sphere Decoder (SD) [1][2] and the breadth
ﬁrst K-best [3][4] algorithms have been proposed by re-
searchers to achieve maximum a posteriori (MAP) decoding
for coded MIMO systems. The depth ﬁrst SD algorithm has
non-deterministic complexity and variable throughput which
makes it sensitive to the channel conditions. The performance
of the depth ﬁrst SD with a small list size suffers degradations
due to the inaccurate and especially the inﬁnite log likelihood
ratio (LLR). On the other hand, the K-best algorithm has
an advantage in hardware implementations since it has ﬁxed
complexity, throughput and latency. However, when K is large,
the complexity of the K-best algorithm dramatically increases
because a large number of paths have to be extended and
sorted.
In this paper, a new efﬁcient Forward-Backward (F-B) trellis
searching algorithm and its VLSI architecture is introduced for
high throughput soft-output MIMO detection. It is based on
a suboptimal double-direction trellis traversal algorithm. This
algorithm always ensures that a full Euclidean distance will
be found for every possible transmitted bit, therefore it avoids
the LLR clipping issues that both depth-ﬁrst SD and K-best
detectors have. The soft K-best algorithm usually does not
perform backward tree traversal which limits its performance
due to the inaccurate LLR generation. In our approach, we
add a new feature by traveling both forward and backward in
the trellis to generate a more accurate LLR for each coded
bit. This low-latency detector offers a good solution for high
throughput MIMO detection.
II. SYSTEM MODEL
We consider a coded MIMO system with M transmit
antennas and N receive antennas. The MIMO transmission
can be modeled as:
y = Hs + n. (1)
where H is an N ×M complex matrix, s = [s0, ..., sM−1]T
is a transmitted vector, y is a N × 1 received vector and n
is a complex Gaussian noise vector with variance σ2. The
symbol vector s is obtained using the mapping function sm =
map(x),m = 0, ...,M−1 where x is an Mc×1 vector of data
bits, and Mc is the number of bits per constellation symbol.
The soft-output detector is to compute the APP L-value of
the bit xk, k = 0, ...,M ·Mc − 1 as
LD(xk|y) = ln P [xk = +1|y]
P [xk = −1|y] = LA(xk) + LE(xk), (2)
where LA and LE denote the a priori L-value and extrinsic L-
value, respectively. Using Max-Log approximation, LD(xk|y)
can be simpliﬁed to [4][5]
LD(xk|y) ≈ min
x∈Xk,−1
Λ(s,y,LA)− min
x∈Xk,+1
Λ(s,y,LA), (3)
where Xk,±1 = {x|xk = ±1}, and
Λ(s,y,LA) =
1
σ2
‖y −H · s‖2 − 1
2
xT · LA. (4)
Using QR decomposition according to H = QR, where Q
and R refer to an N×M unitary matrix and an M×M upper
triangular matrix, respectively, we can write (4) as
Λ(s,y,LA) =
1
σ2
‖yˆ −R · s‖2 − 1
2
xT · LA + C. (5)
where yˆ = QHy, and C is a constant that does not affect the
minimizations in (3).
Solving (3) requires exhaustive search for each bit xk. In
order to reduce the complexity, conventional SD or K-best
detectors can be used to generate a list of L candidates which
have the smallest Euclidean distance to approximate (3).
1892978-1-4244-2941-7/08/$25.00 ©2008 IEEE Asilomar 2008
Authorized licensed use limited to: Rice University. Downloaded on June 16, 2009 at 12:27 from IEEE Xplore.  Restrictions apply.
...
...
...
...
.
..
.
..
.
..
.
..
0
1
Q-1
0
Q-1
0
Q-1
Root node
Tree node
Q branchs
k-1 k0 kM-1
(a) Tree topology
(b) Fully connected trellis diagram
.
..
.
..
.
..
.
..
Trellis node/state
.
..
0
1
Q-1
...
...
...
Fig. 1. Tree topology and its equivalent trellis diagram
III. PROPOSED F-B MIMO DETECTION ALGORITHM
We represent the search-tree with its equivalent trellis di-
agram as shown in Fig. 1 and apply the recursive Forward-
Backward algorithm to solve (3). Fig. 1(b) shows a Q (Q =
2Mc ) state fully connected trellis, numbered from 0 to Q− 1.
There are M steps in the trellis diagram numbered from 0 to
M − 1, note that k−1 corresponds to the tree root node in
Fig. 1(a). The trellis is fully connected in that each state/node
has Q input paths and Q output paths. A path in the trellis can
be represented by a state sequence {q0, q0, .., qM−1}, which
indicates the trellis path starting at state q0, passing through
every state qk at time k, and terminating at state qM−1.
To describe the baseline F-B algorithm, for now, we assume
there is no a priori L-value LA for the detector. Hence solving
(3) is equivalent to ﬁnd the minimum Euclidean distance
Ω = ‖yˆ −R · s‖2 (6)
for each coded bit xk. The F-B algorithm is described as
follows:
A) Forward Recursion:
Let αk(q) be the state metric, which represents the partial
Euclidean distance, for state q (q = 0, 1, ..., Q − 1) at step k
(k = 0, 1, ...,M − 1). Let γ(q′, q) denote the branch metric
from state q′ to q. Let the history of the best forward path
for state q at step k be stored in an array φqk(j), where j
is the index of the array 0 ≤ j ≤ M − 1. The Forward
recursion, which searches from antenna M − 1 to antenna
0, is summarized as follows:
1. Input R, yˆ and Initialization k = −1:
αk(i) =
{
0, i = 0
∞ (1 ≤ i ≤ Q− 1)
φik(0) = ∅, 0 ≤ i ≤ Q− 1 (7)
2. Forward recursion k = 0, 1, ...,M − 1:
For each state q (0 ≤ q ≤ Q− 1)
αk(q) = min
0≤q′≤Q−1
{αk−1(q′) + γk(q′, q)} , (8)
where the branch metric
γ(q′, q) =
∣∣∣yˆM−1−k −
M−1∑
j=M−1−k
RM−1−k,j · sj
∣∣∣2. (9)
The complex transmitted symbol sj in (9), is formed by using
the constellation mapping function:
sj =
{
map(q), j = M − 1− k
map(φq
′
k−1(M − 1− j)) j > M − 1− k
. (10)
After the minimum αk(q) for each state q is found, the forward
history path array φqk is updated by
q˜ = argmin
0≤q′≤Q−1
{αk−1(q′) + γk(q′, q)}
φqk(i) =
{
q, i = k,
φq˜k−1(i), 0 ≤ i ≤ k − 1
. (11)
[0] [1,0] [0,2,0] [0,2,1,0]
[1]
[2]
[3]
[3,1]
[0,2]
[2,3]
[0,2,1]
[2,3,2]
[2,3,3]
[0,2,1,1]
[2,3,3,2]
[2,3,3,3]
k=0 k=1 k=2 k=3
0
1
2
3
Fig. 2. Example of a 4-state trellis after forward recursion.
A forwarding recursion example for a 4× 4 QPSK system
is illustrated in Fig. 2, where the forward history path
array φqk is shown for each state node. For simplicity, only
the survivor path for each state is shown. This algorithm
can be used to ﬁnd the ML path for the trellis, which is
highlighted with a bold line. However, our goal is to ﬁnd
the minimum Euclidean distance Ω for every coded bit.
Except for the ﬁrst antenna (k=3), not every state node
has a fully extended path in Fig. 2. This is because of the
greedy path selection algorithm: only the best path will be
retained for each node. In order to ﬁnd a minimum full
Euclidean distance for every state node in each antenna,
a backward recursion is performed after the forward recursion.
B) Backward Recursion:
1893
Authorized licensed use limited to: Rice University. Downloaded on June 16, 2009 at 12:27 from IEEE Xplore.  Restrictions apply.
Similarly, let βk(q) be the backward state metric. The
backward recursion, which searches from antenna 0 to antenna
M − 1, is summarized as follows:
1. Input R, yˆ, array φqk and Initialization k = M − 1:
βk(i) = 0, 0 ≤ i ≤ Q− 1 (12)
2. Backward recursion k = M − 2, ..., 0:
For each state q (0 ≤ q ≤ Q− 1)
βk(q) = min
0≤q′≤Q−1
{βk+1(q′) + τk(q′, q)} , (13)
where the backward branch metric
τ(q′, q) =
∣∣∣yˆM−2−k −
M−1∑
j=M−2−k
RM−2−k,j · sj
∣∣∣2.(14)
The symbol sj in (14) is formed by using the forward path
array φqk and the incoming state q
′
sj =
{
map(q′), j = M − 2− k,
map(φqk(M − 1− j)), j > M − 2− k.
(15)
k=0 k=1 k=2 k=3
0
1
2
3
Fig. 3. Example of a 4-state trellis after the backward recursion.
Fig. 3 shows the trellis diagram after the backward
recursion. The dotted and the solid lines denote the survivor
pathes for the backward recursion and the forward recursion,
respectively. The backward β metric is an approximation for
the partial Euclidean distance (PED). Combining forward
α metric and backward β metric will give us a good
approximation of the minimum Euclidean distance for
each trellis node. If we examine the trellis diagram after
the backward recursion, every state node now has a fully
extended minimum path metric Ω that can be used to generate
the APP L-value. For pipelined implementation, the APP
L-value can be generated in parallel with the backward
recursion as shown in Fig. 6.
B.2) LLR Generation k = M − 1, ..., 0:
Ωk(q) = αk(q) + βk(q), 0 ≤ q ≤ Q− 1
LD(xM−1−k,j) =
1
σ2
(
min
q<j>=−1
Ω(q)− min
q<j>=+1
Ω(q)
)
where Ωk(q) is an approximation of the minimum Euclidean
distance by combining the forward and backward metric for
state q at step k, and LD(xM−1−k,j) is the APP L-value
for the j-th bit of the transmitted data vector xM−1−k, and
q<j> = ±1 is a sub set of {q} (q = 0, 1, ...Q − 1) with its
j-th bit equal to ±1 (0 ≤ j ≤ log2(Q)− 1).
IV. SIMULATION RESULTS
To evaluate the detection performance, we consider 4×4 16-
QAM and 64-QAM MIMO systems (the channel matrices are
assumed to have independent Rayleigh fading distribution). In
the simulation, the soft-output of the detector is fed to a length
2304, rate 1/2 LDPC decoder [6], which performs up to 15
iterations. Fig. 4 and Fig. 5 show the bit error rate performance
for the proposed F-B detector and the soft K-best complex
detector with different K values. For the 4×4 16-QAM system,
our F-B detector outperforms the K-best detector for K=16 and
32, and achieves similar performance compared with K=64.
The same trend has been observed for the 4 × 4 64-QAM
system where our F-B detector outperforms the soft K-best
detector with K=32, 48 and 64.
9.5 10 10.5 11 11.5 12
10-8
10-7
10-6
10-5
10-4
10-3
10-2
10-1
Eb/N0 (dB)
B
E
R
4x4 16-QAM MIMO with LDPC (R=1/2, L=2304)
 
 
F-B trellis
K-best (K=16)
K-best (K=32)
K-best (K=64)
Fig. 4. Simulation results for 4× 4 16-QAM MIMO system.
13 13.5 14 14.5 15 15.5 16 16.5 17 17.5 18
10-7
10-6
10-5
10-4
10-3
10-2
10-1
100
Eb/N0 (dB)
B
E
R
4x4 64-QAM MIMO with LDPC (R=1/2, L=2304)
 
 
F-B trellis
K-best (K=32)
K-best (K=48)
K-best (K=64)
Fig. 5. Simulation results for 4× 4 64-QAM MIMO system.
1894
Authorized licensed use limited to: Rice University. Downloaded on June 16, 2009 at 12:27 from IEEE Xplore.  Restrictions apply.
V. ARCHITECTURE DESIGN
Fig. 6 shows the proposed architecture based on the F-
B algorithm. The proposed architecture is very suitable for
VLSI implementation because it has a regular data ﬂow, ﬁxed
complexity , and ﬁxed throughput and latency. Compared with
the soft K-best detector, our architecture has less complexity
since no sorting is required. Only ﬁnding the minimum value is
required for the proposed architecture. Therefore, the critical
path of the F-B detector would be shorter than the K-best
detector. The latency is also reduced.
A tile chart is used to represent the detector data ﬂow, which
is shown in Fig. 7. The X-axis represents the MIMO symbol
sequences and the Y-axis represents the decoding time. In
Fig. 7, the forward recursion (F) is followed by the backward
recursion (B), and the LLRs are generated in parallel with the
backward recursion.
α unit 
(x Q)
Φ path 
memory
β unit 
(x Q)
Input 
memory
α memory 
LLR 
Generator
, R
LD
yˆ
Fig. 6. F-B detector architecture.
F
B
LL
R
31 2
Tim
e
Symbol
F
B
LL
R
Fig. 7. Detection data ﬂow tile chart.
A. Partial Euclidean Distance Computation Unit
The main computation units of this detector are the α and
β units. Since α and β units have very similar structures,
we use node processing unit (NPU) hereinafter to represent
them. NPU is responsible for calculating the partial Euclidean
distance (PED) and ﬁnding the minimum PEDs among all the
candidates. Because of the upper triangular property of the R
matrix, the PED can be computed in a recursive way which
is shown below:
di = di+1 + ||ti+1 + Ri,iSi||2
ti+1 =
M−1∑
j=i+1
Ri,jSj − yˆi, (16)
where dM is initialized to be 0, and d0 is the full Euclidean
distance.
Since the trellis structure is fully connected, each state node
needs to compute Q PEDs. Fig. 8 shows the architecture of
the partial Euclidean distance computation unit (PEU). For
simplicity, we assume a QPSK modulation scheme with M
transmit and receiving antennas. In Fig. 8, SADD stands for
shift and add which implements Ri,jSj , where Cx (x =
0, 1, ..Q − 1) are the constant constellation points. For the
QPSK scheme, Q = 4, new partial Euclidean distances
(NPEDs) are computed and are sent to the compare and select
unit for further processing.
SADD
SADD
+
SADD +
+
SADD
SADD
SADD
SADD
+
+
+
+
| |2
| |2
| |2
| |2
D
D
D
D
+
+
+
+
D
D
D
D
Ri,i
Ri,i+1
Ri,i+2
Ri,i+3
Si+1 C0
Ri,i
Ri,i
Ri,i
Si+2
Si+3
iyˆ
Cx
Ri,i
PED
PED
NPED0
NPED1
NPED2
NPED3
C1
C2
C3
-
PEU
Fig. 8. PEU architecture for QPSK system.
B. Compare and Select Unit
The compare and select (CSU) unit, which is shown in
Fig. 9, is used to ﬁnd the minimum PED from Q input NPEDs.
MIN
MIN
MIN
IN0
IN1
IN2
IN3
MIN
CSU
Fig. 9. Compare and select architecture for QPSK system.
C. Node Processing Unit
In each step of the trellis algorithm, Q state nodes are
working independently and therefore can be processed in
parallel. By instantiating Q = 4 PEUs and CSUs, the top
level node processing unit (NPU) for QPSK systems is shown
in Fig. 10. This is an iterative hardware architecture which
implements (16). And the latency for one iteration is 3 cycles.
D. Architecture for Higher order Modulation Systems
We have shown the hardware architecture for QPSK sys-
tems, now we will extend it for higher order modulation
schemes such as 16-QAM. The PEU-E and CSU-E in Fig. 11
are extensions of the PEU and CSU by replicating the hard-
ware four times to support 16-QAM system. The PEU-E unit is
used for computing 16 branches metrics. And the CSU-E unit
is used for selecting the minimum PED from 16 candidates.
Based on the PEU-E and CSU-E units, the node processing
unit (NPU) for a 16-QAM system has a very similar architec-
ture as the QPSK system. As shown in Fig. 12, 16 PEU-Es
1895
Authorized licensed use limited to: Rice University. Downloaded on June 16, 2009 at 12:27 from IEEE Xplore.  Restrictions apply.
PEU0
PEU1
PEU2
PEU3
In
te
rc
on
ne
ct
s
CSU0
CSU1
CSU2
CSU3
n00
n01
n02
n03
n10
n11
n12
n13
n20
n21
n22
n23
n30
n31
n32
n33
n00
n10
n20
n30
n01
n11
n21
n31
n02
n12
n22
n32
n03
n13
n23
n33
.
.
.
R, S, Cx
PED0
iyˆ PED0
D
PED1
D
PED2
D
PED3
D
R, S, Cx
PED1
iyˆ
R, S, Cx
PED2
iyˆ
R, S, Cx
PED3
iyˆ
...
...
...
...
...
...
Fig. 10. NPU architecture for QPSK system
PEU
PEU
PEU
PEU
R, S, Cx
PED
iyˆ
CSU
CSU
CSU
CSU
CSU
PEU-E
CSU-E
MIN
(x=0-15)
Fig. 11. PEU and CSU architecture for 16-QAM system
and CSU-Es are instantiated so that 16 nodes can be processed
in parallel. The latency for each iteration remains to be 3
cycles. Therefore, the throughput for a 16-QAM system will
be increased to 2 times that of the QPSK system.
PEU-E 0
In
te
rc
on
ne
ct
s
n0_0
n0_15
n15_0
n15_15
n0_0
n15_0
n0_15
n15_15
R, S, Cx
PED0
iyˆ PED0
D
PED15
DR, S, Cx
PED15
iyˆ
...
.
.
.
.
.
.
...
... ...PEU-E 15
CSU-E 0
CSU-E 15
Fig. 12. NPU architecture for 16-QAM system
E. Hardware Complexity and Throughput Analysis
Table I shows the hardware complexity, detection through-
put, and latency analysis for 4×4 QPSK and 16-QAM systems.
The gate count estimation is based on a TSMC 65nm standard
cell CMOS library. The highest clock frequency that the
detector can achieve is about 400MHz. The decoding latency
for a 4× 4 system is 3×M = 12 cycles.
Table II compares the detection throughput and hardware
complexity of the proposed F-B solution versus two hardware
implementations from the literature: depth-ﬁrst soft sphere
TABLE I
COMPLEXITY AND THROUGHPUT/LATENCY ANALYSIS
Gate count Max throughput Latency
4× 4 QPSK 36 K 266 Mbps 12 Cycles
4× 4 16-QAM 576 K 533 Mbps 12 Cycles
detector with 256 search operations (fclk=122.88MHz) from
[1], and soft K-best detector (fclk=200MHz) from [4]. In [4],
a real QR decomposition is used with a small K=5. Based
on the simulation results in Fig. 4, our solution has a better
BER performance than [4] and can achieve a faster throughput
because we limit the number of sorting operations which
is very expensive in the hardware implementation. On the
other hand, at a cost of more hardware resources, the depth-
ﬁrst detector in [1] has a better BER performance than our
solution. However [1] has a limited throughput because of
the large number of sequential searching operations and the
most undesired feature of [1] is its variable throughput at
different SNR levels. Our architecture provides a good solution
in between the depth-ﬁrst detector and the K-best detector.
TABLE II
COMPARISON OF SOFT 4× 4 16-QAM MIMO DETECTORS
[1] [4] This work
Throughput 38.8 Mbps 106 Mbps 533 Mbps
Gate count 1100 K 97 K 576 K
VI. CONCLUSION
We propose a new MIMO detector architecture based on
the Forward-Backward recursion algorithm. This scheme can
achieve very high throughput and can be easily parallelized.
Both throughput and latency is deterministic, hence it is very
suitable for hardware implementation.
VII. ACKNOWLEDGEMENT
This work was supported in part by Nokia and by NSF under
grants CCF-0541363, CNS-0551692, and CNS-0619767.
REFERENCES
[1] D. Garrett, L. Davis, S. ten Brink, B. Hochwald, and G. Knagge, “Silicon
Complexity for Maximum Likelihood MIMO Detection Using Spherical
Decoding,” IEEE J. Solid-State Circuit, vol. 39, pp. 1544–1552, Sep 2004.
[2] P. Radosavljevic and J. R. Cavallaro, “Soft Sphere Detection with
Bounded Search for High-Throughput MIMO Receivers,” in IEEE Asilo-
mar Conf. on Signals, Syst. and Computers, Oct 2006, pp. 1175–1179.
[3] K. Wong, C. Tsui, R. Cheng, and W. Mow, “A VLSI architecture of a K-
best lattice decoding algorithm for MIMO channels,” in IEEE Int. Symp.
on Circuits and Syst., vol. 3, May 2002, pp. 273–276.
[4] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-
best sphere decoding for MIMO detection,” IEEE J. Selected Areas in
Commun., vol. 24, pp. 491–503, Mar 2006.
[5] B. Hochwald and S. Brink, “Achieving Near-Capacity on a Multiple-
Antenna Channel,” IEEE Tran. Commun., vol. 51, pp. 389–399, Mar
2003.
[6] Y. Sun, M. Karkooti, and J. R. Cavallaro, “VLSI Decoder Architecture
for High Throughput, Variable Block-size and Multi-rate LDPC Codes,”
in IEEE Int. Symp. on Circuits and Systems, May 2007, pp. 2104–2107.
1896
Authorized licensed use limited to: Rice University. Downloaded on June 16, 2009 at 12:27 from IEEE Xplore.  Restrictions apply.
