High Throughput VLSI Architecture for Soft-Output MIMO Detection Based on A Greedy Graph Algorithm by Sun, Yang & Cavallaro, Joseph R.
High Throughput VLSI Architecture for Soft-Output MIMO
Detection Based on A Greedy Graph Algorithm ∗
Yang Sun
Depart. of Electrical and Computer Engineering
Rice University
6100 Main Street, Houston, TX 77005
ysun@rice.edu
Joseph R. Cavallaro
Depart. of Electrical and Computer Engineering
Rice University
6100 Main Street, Houston, TX 77005
cavallar@rice.edu
ABSTRACT
Maximum-likelihood (ML) decoding is a very computational-
intensive task for multiple-input multiple-output (MIMO)
wireless channel detection. This paper presents a new graph
based algorithm to achieve near ML performance for soft
MIMO detection. Instead of using the traditional tree search
based structure, we represent the search space of the MIMO
signals with a directed graph and a greedy algorithm is ap-
plied to compute the a posteriori probability (APP) for each
transmitted bit. The proposed detector has two advantages:
1) it keeps a fixed throughput and has a regular and parallel
datapath structure which makes it amenable to high speed
VLSI implementation, and 2) it attempts to maximize the a
posteriori probability by making the locally optimum choice
at each stage with the hope of finding the global minimum
Euclidean distance for every transmitted bit xk ∈ {−1,+1}.
Compared to the soft K-best detector, the proposed solution
significantly reduces the complexity because sorting is not
required, while still maintaining good bit error rate (BER)
performance. The proposed greedy detection algorithm has
been designed and synthesized for a 4 × 4 16-QAM MIMO
system in a TSMC 65 nm CMOS technology. The detector
achieves a maximum throughput of 600 Mbps with a 0.79
mm2 core area.
Categories and Subject Descriptors
B.7.1 [Types and Design Styles]: Algorithms implemented
in hardware
General Terms
Algorithm, Design, Performance
Keywords
MIMO Detection, VLSI Architecture, ASIC Design
∗This work was supported in part by Nokia Corporation,
and by NSF under grants CCF-0541363, CNS-0551692, and
CNS-0619767.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
GLSVLSI’09, May 10–12, 2009, Boston, Massachusetts, USA.
Copyright 2009 ACM 978-1-60558-522-2/09/05 ...$5.00.
1. INTRODUCTION
Multiple-input multiple-output (MIMO) communication
systems have received tremendous attention because of their
high spectral efficiency and near-capacity performance. New
wireless standards, such as IEEE 802.11n, IEEE 802.16e
WiMax, and 3GPP LTE, include MIMO techniques in com-
bination with advanced outer channel codes such as low-
density parity-check (LDPC) codes [3] and Turbo codes [1].
The main challenge of the soft MIMO detection is to ef-
ficiently and accurately generate the log-likelihood radios
(LLRs) for the outer channel decoder. The exhaustive search
with the ML criterion will consume enormous computing
power and require tremendous silicon resources on the chip
which makes it impossible to be employed in multiple an-
tenna systems with higher-order modulation schemes.
To reduce the exponentially algorithmic complexity, some
sub-optimal detection algorithms and their VLSI architec-
tures have been proposed by researchers recently[4, 2, 10,
5, 7, 9]. Garret et al. [4] implemented a depth-first soft
sphere decoding (SD) algorithm with 256 search operations
at each level of the tree. Burg et al. [2] presented a simpli-
fied hard sphere decoding ASIC architecture. On the other
hand, Wong et al. [10] first introduced the breadth-first
K-best hard detection algorithm. Later on, Guo et al. [5]
extended it for the soft K-best (K=5) detection by keeping
a list of best candidates at each search tree level.
The depth-first SD algorithm has non-deterministic com-
plexity and variable throughput which makes it sensitive to
the channel conditions. The depth-first SD with a small
candidate list size suffers significant performance degrada-
tion due to the inaccurate and especially the infinite log like-
lihood ratios (LLRs). On the other hand, the K-best algo-
rithm has advantages of fixed complexity and fixed through-
put, which makes it more friendly for hardware implementa-
tion. However, when K is large, the complexity of the K-best
algorithm dramatically increases because a large number of
paths have to be extended and sorted.
In this paper, a greedy shortest path searching algorithm
and its VLSI architecture is proposed for high throughput
soft MIMO detection. We transform the traditional MIMO
detection problem into a shortest path finding problem. By
making a locally optimum choice at each stage, this algo-
rithm tries to find the global minimum Euclidean distance
for every transmitted bit. Therefore it avoids the LLR clip-
ping issues that both depth-first SD and K-best detectors
have. Moreover, this approach is very suitable for high speed
VLSI implementation because of the regular and parallel
datapath structure.
445
2. SYSTEM MODEL
We consider a coded MIMO system with M transmit an-
tennas and N receive antennas. The MIMO transmission
can be modeled as:
y = Hs+ n, (1)
where H is an N ×M complex matrix, s = [s0, ..., sM−1]T
is an M × 1 transmitted vector, y is an N × 1 received
vector, and n is an independent identically distributed com-
plex zero-mean Gaussian noise vector with variance σ2 per
direction. The symbol vector s is obtained using the map-
ping function sm = map(x
<m>),m = 0, ...,M − 1, where
x<m> is an Mc × 1 vector (block) of transmitted data bits
(x<m> = [x0, x1, ..., xMc−1]
T ), and Mc is the number of
bits per constellation symbol. For instance, Mc = 2, s ∈
{1+j, 1−j,−1+j,−1−j} in the case of QPSK, andMc = 4,
s ∈ {±1±j,±1±3j,±3±j,±3±3j} in the case of 16-QAM.
The concatenating x<0>, ...,x<M−1> is written in x. The
soft-output detector is to compute the APP value of the bit
xk, for k = 0, ...,M ·Mc − 1. The APP is usually expressed
as a log-likelihood ratio value (L-value):
L(xk|y) = ln P [xk = +1|y]
P [xk = −1|y] = LA(xk) + LE(xk), (2)
where LA and LE denote the a priori L-value and extrinsic
L-value, respectively. Assuming there is no prior knowledge
of the transmitted signal, using the max-log approximation
[6], (2) can be simplified to
L(xk|y) ≈ 1
2σ2
(
min
x∈Xk,−1
Λ(s,y)− min
x∈Xk,+1
Λ(s,y)
)
, (3)
where set Xk,+1 = {x|xk = +1} and set Xk,−1 = {x|xk =
−1}. Using QR decomposition according toH = QR, where
Q and R refer to an N ×M unitary matrix and an M ×M
upper triangular matrix, respectively, Λ(s,y) can be com-
puted as
Λ(s,y) = ‖y −H · s‖2 = ‖yˆ −R · s‖2 + C, (4)
where yˆ = QHy, and C is a constant (C = 0 if M = N),
which does not affect (3).
3. GREEDY DETECTION ALGORITHM
Without loss of generality, we use a 4 × 4 QPSK system
to explain our proposed algorithm in this section.
3.1 Graph construction
The goal of the soft MIMO detector is to generate the
LLR value for each transmitted bit xk based on (3), which
requires the calculation of the minimum Euclidean distance
Λ =
∥∥∥∥∥∥∥∥

yˆ0
yˆ1
yˆ2
yˆ3
−

R00 R01 R02 R03
0 R11 R12 R13
0 0 R22 R23
0 0 0 R33


s0
s1
s2
s3

∥∥∥∥∥∥∥∥
2
, (5)
over sets Xk,+1 and Xk,−1. The calculation of Λ can be
decomposed as: Λ = w<0> + w<1> + w<2> + w<3>, where
w<t> is the 1-D Euclidean distance and is calculated as
w<0> = ‖yˆ3 −R33s3‖2,
w<1> = ‖yˆ2 − (R22s2 +R23s3)‖2,
w<2> = ‖yˆ1 − (R11s1 +R12s2 +R13s3)‖2,
w<3> = ‖yˆ0 − (R00s0 +R01s1 +R02s2 +R03s3)‖2.(6)
This process can be viewed using a flow graph which is shown
in Figure 1. The vertices are ordered into 4 vertical slices or
stages. Stages 0, 1, 2, 3 represent antennas 3, 2, 1, 0 respec-
tively. In each stage, there are Q = 2Mc vertices. Each ver-
tex represents a possible complex constellation point. Each
vertex at each stage is connected to Q vertices at an earlier
and Q vertices at a later stage. The vertex in stage t is
represented as v(t, i) (0 ≤ i ≤ Q − 1). The edge between
v(t − 1, i) and v(t, j) has a weight of w<t>i,j . Because of the
upper triangular property of R, from (6) we know that the
weight function dose not depend on the future stages, but
only depends on its current stage and all its predecessors.
For instance, w<0>i,j only depends on the vertices in the stage
0, while w<2>i,j depends on the vertices in stages 2, 1, and 0.
The edges connected to the dummy terminus node have 0
weights.
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
Slice 0
(Antenna 3)
Constellation Point
ToorRoot
Weight
Slice 1
(Antenna 2)
Slice 2
(Antenna 1)
Slice 3
(Antenna 0)
Weight=0
Weight
><2
, ji
w
><
−
0
, j
w
Stage 0 Stage 1 Stage 2 Stage 3
Figure 1: Flow graph for MIMO detection
3.2 Problem definition
Given a received, possibly noisy MIMO symbol, we may
associate the 1-D Euclidean distance with weights on each
edge in the graph so that the problem of ML detection re-
duces to the problem of finding the minimum-weight path
from the root to the toor in the graph.
Definition 1: Hard MIMO detection problem:
Find the shortest path from the root to the toor. Then the
encountered vertices are the detected MIMO signals.
Definition 2: Soft MIMO detection problem:
For each vertex i (0 ≤ i ≤ Q − 1) in the stage t (0 ≤
t ≤ M − 1), find the shortest path, which must contain
this vertex, from the root to the toor. The Q conditioning
shortest paths found at every stage t make a candidate list
Lt. Then the L-value of bit x<t>i is calculated as:
L(x<t>i |y) = 12σ2
(
min
x∈Lt,−1
Λ− min
x∈Lt,+1
Λ
)
. (7)
3.3 Greedy algorithm
The optimum solution to the hard detection problem re-
quires full search over the entire graph whose complexity
grows exponentially with the size of the graph. Solving the
soft detection problem is even harder since it needs to re-
peatedly perform full search on the condition of every vertex
being included in this shortest path. In this section, we will
introduce a greedy shortest path algorithm to approximately
solve the soft detection problem. Like the K-best algorithm,
446
it takes decisions on the basis of information at hand with-
out worrying about the effect these decisions may have in
the future.
Step 1. Edge reduction
In Figure 1, each vertex i at each stage t (except for the
first and last stages), is connected to Q vertices at an earlier
stage and Q vertices at a later stage. Figure 2 shows the
data flow at each vertex which has Q incoming subpaths
h0, ..., hQ−1 and Q outgoing subpaths h′0, ..., h
′
Q−1.
i.
h
0 , d
0
hQ-
1
, dQ
-1
h’ 0
, d
’ 0
h’Q
-1, d’
Q-1
.
. .
.
.
Stage t
h
m, d
m
Best incoming subpath
Figure 2: Data flow at vertex v(t, i)
To reduce the arithmetic complexity, a greedy algorithm
is summarized as follows. Let the partial distance be dk,
which is the cumulative weight of the subpath hk from the
root to this vertex i. Among the Q incoming subpaths, we
select the best subpath hm with the minimum weight
m = argmin dm
m∈{0,...,Q−1}
, (8)
and discard the other Q − 1 subpaths. After choosing the
best subpath hm, the outgoing subpath to vertex v(t+1, k)
is updated as
h′k = {hm, k}, 0 ≤ k ≤ Q− 1. (9)
The outgoing path weight to vertex v(t+1, k) is updated as
d′k = dm + w
<t+1>
i,k , 0 ≤ k ≤ Q− 1, (10)
where the weight function w<t+1>i,k is calculated based on the
vertices along the subpath h′k according to (6). Moreover,
among the Q outgoing subpaths we also find the shortest
subpath h′n where
n = argmin d′n
n∈{0,...,Q−1}
. (11)
This information will be stored in memory for later use.
Figure 3 shows an example of the result graph after applying
the edge reduction operation. Note that only the surviving
path for each vertex is shown in Figure 3. We can see that
each vertex in stage 3 is along a path to the end point (toor).
These paths are presumably the shortest and can be used to
form the candidate list L3. However not every vertex in the
stages 0, 1, and 2 is along a path to the end point. To solve
this issue, we need to perform a path extension operation.
Step 2. Path extension
A. Path extension for stage 2
In Figure 3, if we look at the vertices on stage 2, we will
see that not every vertex is along a path to the toor. For
example, vertices 0 and 3 are disconnected with the toor.
Therefore, we need to extend those uncompleted paths. The
path extension algorithm is summarized as follows. Ex-
tend each subpath by checking all its Q outgoing edges and
select the edge which leads to the shortest cumulative subpath
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
ToorRoot
Stage 0 Stage 1 Stage 2 Stage 3
Figure 3: Edge reduction result
weight. This process will be repeated in a greedy manner un-
til it reaches the toor. Figure 4 illustrates one level path
extension operation for v(t, i).
i
Stage t
Best
 outg
oing
 
subp
ath
Incoming 
subpath
Figure 4: Path extension operation: select the best
outgoing subpath
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
ToorRoot
Stage 0 Stage 1 Stage 2 Stage 3
Figure 5: Path extension result for stage 2
Recall that in the step of edge reduction, we have saved the
shortest outgoing subpath h′n for each vertex into memory. If
we retrieve this information for each vertex in stage 2, Figure
5 shows the result graph. Here the dotted lines represent
the outgoing edges retrieved from the memory. Now each
vertex in stage 2 is along a path to the toor (presumably
the shortest). So the candidate list L2 can be created in
this way.
B. Path extension for stage 1
Similarly, in Figure 3, not every vertex in stage 1 is along a
path to the toor. By retrieving the shortest outgoing sub-
path from memory for each vertex in the stage 1, we could
extend these subpaths for one level as shown in Figure 6(a),
where the four extended subpaths are labeled as A, B, C,
and D. It is worth to mention that no re-computing, but
memory read is needed for the first step of the path ex-
tension operation. However the second level path extension
operation needs re-computing and comparing the subpath
weights as described in the path extension algorithm. The
result after applying a second level path extension is shown
447
in Figure 6(b). Now the subpaths {A, B, C, D} have been
fully extended and can be used to form the list L1.
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
ToorRoot
Stage 0 Stage 1 Stage 2 Stage 3
A
B
C
D
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
ToorRoot
Stage 0 Stage 1 Stage 2 Stage 3
A
B
C
D
A
B
C
D
(a) Step 1: Retrieve paths from memory
(b) Step 2: Re-computing the best path for the next stage
Figure 6: Path extension for stage 1
B. Path extension for stage 0
Similarly, by applying the path extension algorithm repeat-
edly, Figure 7 shows the path extension result for stage 0.
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
ToorRoot
Stage 0 Stage 1 Stage 2 Stage 3
A
B
C
D
A
B
C
D
A
B
C
D
Figure 7: Path extension result for stage 0
Step 3. LLR calculation
After all the candidate lists Lt for 0 ≤ t ≤ M − 1 have
been created, the LLR calculation defined in (7) is then very
straightforward.
3.4 Algorithm complexity analysis
In the proposed greedy algorithm, calculating the partial
Euclidean distance is the major contribution to the total
arithmetic complexity. We only consider this part in the
complexity analysis and ignore the other minor contribu-
tors such as minimization and memory operations. More
generally, consider an M transmit antenna system with Q-
size QAM modulation. In the edge reduction operation, the
number of subpath weights that need to be calculated per
stage is Q2, so the total complexity is O(MQ2). On the
other hand, the complexity of the path extension operation
is O( 1
2
(M − 1)(M − 2)Q2). In a practical MIMO system,
M is usually not very large. In the case of M = 4, the total
arithmetic complexity of this greedy algorithm is approxi-
mately O(7Q2).
4. SIMULATION RESULTS
Like the K-best algorithm, the proposed greedy algorithm
has a deterministic complexity and a fixed throughput. To
evaluate the decoding BER performance, we have compared
the proposed greedy algorithm with the traditional K-best
algorithm. We consider 4× 4 16-QAM and 64-QAM MIMO
systems (the channel matrices are assumed to have inde-
pendent Rayleigh fading distribution). In the simulation,
the soft-output of the detector is fed to a length 2304, rate
1/2 WiMax LDPC decoder [8], which performs up to 15 it-
erations. Figure 8 compares the BER performance for the
proposed MIMO detector. For the 4 × 4 16-QAM system,
our detector outperforms the K-best detector for K=16 and
32, and achieves similar performance compared with K=64.
For the 4×4 64-QAM system, our detector outperforms the
K-best detector with K=32, 48 and 64.
9.5 10 10.5 11 11.5 12
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
Eb/N0 (dB)
B
E
R
4x4 16-QAM MIMO with LDPC (R=1/2, L=2304)
 
 
Greedy
K-best (K=16)
K-best (K=32)
K-best (K=64)
13 13.5 14 14.5 15 15.5 16 16.5 17 17.5 18
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
0
Eb/N0 (dB)
B
E
R
4x4 64-QAM MIMO with LDPC (R=1/2, L=2304)
 
 
Greedy
K-best (K=32)
K-best (K=48)
K-best (K=64)
Figure 8: BER performance comparison
5. VLSI ARCHITECTURE DESIGN
In this section, we will describe the hardware architecture
design for a 4 × 4 16-QAM MIMO system. The greedy al-
gorithm introduced in section 3 is generic and can be easily
extended for higher order modulation systems, i.e. in the
case of 16-QAM, there are 16 vertices instead of 4 at each
stage. Figure 9 shows the top level hardware architecture
for implementing the proposed greedy detection algorithm.
It contains four major units: Edge Reduction Unit (ERU),
Memory Module, Path Extension Unit (PEU), and LLR Cal-
culation Unit (LCU).
Edge 
Reduction 
Unit
(ERU)
Memory 
Module
Path 
Extension 
Unit
(PEU)
LLR 
Calculation 
Unit
(LCU)
yˆR, LLR
Figure 9: Top level architecture
448
5.1 Edge reduction unit (ERU)
We define the subpath metric (SM) for vertex v(t, i) as the
cumulative path weight (or partial Euclidean distance) up to
this vertex v(t, i). In the step of edge reduction, each vertex
will compare the Q incoming SMs and select the edge with
the minimum SM, and prune the other Q−1 edges. Then the
Q outgoing SMs are computed by adding the corresponding
edge weight to the surviving incoming subpath weight and
sent to the downstream vertices.
VPU 0
CS-A 0
n
0
_
0
n
0
_
1
n
0
_
1
5
. . .
n
0
_
0
n
1
_
0
n
1
5
_
0
. . .
CS-B 0
n
0
_
0
n
0
_
1
n
0
_
1
5
. . .
CS-A 15
n
0
_
1
5
n
1
_
1
5
n
1
5
_
1
5
. . .
CS-B15
n
1
5
_
0
n
1
5
_
1
n
1
5
_
1
5
. . .
VPU 15
n
1
5
_
0
n
1
5
_
1
n
1
5
_
1
5
. . .
. . . X 16
Interconnects
to VPU 0
dm0
to Memory to VPU 15 to Memory
dm15
yˆR, yˆR,
. . .
dm0 dm15
Figure 10: ERU architecture
Figure 10 illustrates the ERU architecture, where VPU
stands for Vertex Processing Unit, and CS stands for Com-
pare and Select. This is a partially-parallel architecture by
having Q vertices being processed simultaneously. This is
also a recursive architecture by reusing the logic for different
stages. Note there are two CS units: CS-A and CS-B. CS-A
unit is used to select the minimum incoming SMs and pass
the survivor to the VPU in the next iteration. CS-B unit is
used to select the shortest outgoing SM for each vertex and
save it to memory for path extension operation. Figure 11
and Figure 12 show the VPU and CS architecture, respec-
tively. The VPU unit is used to calculate the outgoing SMs
(partial Euclidean distances). At stage t (−1 ≤ t ≤M − 2),
let i =M − 2− t, the k-th (0 ≤ k ≤ Q− 1) outgoing SM is
updated as:
d<t+1>k = d
<t>
m + ||T +Ri,iSi||2
T =
M−1∑
j=i+1
Ri,jSj − yˆi, (12)
where Si is the complex constellation point for antenna i, the
antennas are numbered 3, 2, 1, and 0, which correspond to
stages 0, 1, 2, and 3 respectively. The values of Si+1, ...SM−1
are obtained based on the vertices encountered in the in-
coming subpath hm. The incoming subpath metric d
<t>
m is
generated by the CS-A unit (d<−1>m is initialized to 0). The
SADD in Figure 11 stands for shift and add which is used
for implementing R ∗ S.
5.2 Path extension unit (PEU)
The function of the PEU is to extend the subpaths, which
were obtained in the edge reduction step, in a greedy and re-
cursive fashion. Both memory read and path re-computation
are required for stage 0 and 1. Only memory read is required
for stage 2. Figure 13 shows the PEU architecture which
contains 16 VPUs and 16 CSs such that 16 subpaths can be
SADD
SADD
+
SADD
+ +
S
A
D
D
 N
e
tw
o
rk
+
+
| |
2
| |
2
D
D
+
+
D
D
R i,i
Ri,i+1
Ri,i+2
Ri,i+3
Si+1
Si+2
Si+3
i
yˆ
dm
d0
-
. . .
d15
Ri,i*(1+j)
Ri,i*(3+3j)
+ | |2
D
+
D
d1Ri, i*(1-j)
. . .
. . .
T
Figure 11: VPU architecture
MIN MIN MIN MIN
MIN MIN
MIN MIN MIN MIN
MIN MIN
MIN MIN
MIN
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
dm
D
Figure 12: CS architecture
extended in parallel. The initial subpath metric dm is read
from the memory which contains the results from the edge
reduction operation.
VPU CS
d0
d1
d15
. . .
dm
0
Memory
15
.
.
.
1
Figure 13: PEU architecture
5.3 LLR calculation unit (LCU)
After obtaining the list Lt, for 0 ≤ t ≤ M − 1, the LCU
implements (7) in a straightforward way. The detail imple-
mentation is omitted.
5.4 Hardware scheduling
Figure 14 shows the timing diagram for a 4× 4 16-QAM
MIMO system. Each stage of the edge reduction takes three
cycles to finish, so it takes 3 × 4 = 12 cycles to perform
the edge reduction operation. The path extension operation
for antenna 3 can start 6 cycles later and will take 6 cycles
to finish. The path extension operation for antenna 2 will
take another 3 cycles. So the total latency is 15 cycles.
Taking into account the two extra cycles of latency for LLR
generation, the total decoding latency for one MIMO symbol
is 17 cycles. In terms of throughput, because two consecutive
MIMO symbols can be overlapped as shown in Figure 14,
the decoding throughput for a 4×4 16-QAM MIMO system
is
M ×Mc × fclk
Cycle count
=
4× 4× fclk
12
=
4
3
fclk. (13)
449
S3 S2 S1 S0
S3 S2
S3 S2 S1 S0
S3 S2
Edge reduction
Path extension
0 3 6 9 12
Symbol k Symbol k+1
Symbol k Symbol k+1
Cycle count
Latency = 15
Figure 14: Timing diagram for a 4 × 4 16-QAM
MIMO detection
6. VLSI IMPLEMENTATION RESULT
A 4×4 16-QAM soft MIMO detector has been synthesized
(using Synopsys Design Compiler), placed and routed (using
Cadence SoC Encounter) for a TSMC 65nm CMOS technol-
ogy. Figure 15 shows the VLSI layout view of the proposed
MIMO detector. The fixed-point bit precision for R and yˆ
are 10 bits. The LLR outputs are represented in 7 bit. Based
on the fixed-point simulation result, the finite word-length
implementation leads to negligible performance degradation
from using the floating-point representation. The maximum
achievable clock frequency is 450 MHz based on the post-
layout simulation. The corresponding maximum throughput
is 600 Mbps.
ERU
PEU
LCU
MEM
Figure 15: VLSI layout photo
Table 1 compares the detection throughput and hardware
complexity of the proposed detector versus two state-of-the-
art detectors from the literature: depth-first soft sphere de-
tector with 256 search operations from [4], and soft K-best
detector from [5]. In [5], a real QR decomposition is used
with a small K=5. Based on the simulation results in Fig. 8,
our solution has a better BER performance than [5] and can
achieve a faster throughput because we avoid the sorting
operation which is very expensive in the hardware imple-
mentation. On the other hand, at a cost of more hard-
ware resources, the depth-first detector in [4] has a better
BER performance than our solution. However [4] has a lim-
ited throughput because of the large number of sequential
searching operations and it has variable throughput at dif-
ferent SNR levels. Our architecture provides a good solution
in between the depth-first detector and the K-best detector.
Table 1: Architecture Comparison
[4] [5] This work
Throughput 38.8 Mbps 106 Mbps 600 Mbps
Core area 10 mm2 0.56 mm2 0.79 mm2
Gate count 1100 K 97 K 550 K
Max frequency 122.88 MHz 200 MHz 450 MHz
Technology 180 nm 130 nm 65 nm
7. CONCLUSION
We propose a new soft-output MIMO detector architec-
ture based on a greedy graph algorithm. This detector can
achieve a very high throughput of 600 Mbps at a hardware
cost of only 550 K gates. Compared with other solutions, the
proposed detector has a significant improvement in terms of
detection throughput, latency, and area while still maintain-
ing good bit error rate (BER) performance.
8. REFERENCES
[1] C. Berrou, A. Glavieux, and P. Thitimajshima. Near
Shannon limit error-correcting coding and decoding:
Turbo-Codes. In IEEE Int. Conf. Commun., pages
1064–1070, May 1993.
[2] A. Burg, M. Borgmann, M. Wenk, M. Zellweger,
W. Fichtner, and H. Bolcskei. VLSI Implementation of
MIMO Detection Using the Sphere Decoding
Algorithm. IEEE J. Solid-State Circuit, 40:1566–1577,
July 2005.
[3] R. Gallager. Low-density parity-check codes. IEEE
Tran. Info. Theory, 8:21–28, Jan. 1962.
[4] D. Garrett, L. Davis, S. ten Brink, B. Hochwald, and
G. Knagge. Silicon Complexity for Maximum
Likelihood MIMO Detection Using Spherical
Decoding. IEEE J. Solid-State Circuit, 39:1544–1552,
Sept. 2004.
[5] Z. Guo and P. Nilsson. Algorithm and implementation
of the K-best sphere decoding for MIMO detection.
IEEE J. Selected Areas in Commun., 24:491–503,
Mar. 2006.
[6] B. Hochwald and S. Brink. Achieving Near-Capacity
on a Multiple-Antenna Channel. IEEE Tran.
Commun., 51:389–399, Mar. 2003.
[7] Q. Li and Z. Wang. Improved K-best sphere decoding
algorithms for MIMO systems. In IEEE Int. Symp. on
Circuits and Syst., pages 2190–2194, Nov. 2006.
[8] Y. Sun and J. R. Cavallaro. A low-power 1-Gbps
reconfigurable LDPC decoder design for multiple 4G
wireless standards. In IEEE International SOC
Conference, pages 367–370, Spet. 2008.
[9] Y. Sun and J. R. Cavallaro. A New MIMO Detector
Architecture Based on a Forward-Backward Trellis
Algorithm. To appear in IEEE Asilomar Conference
on Signals, Systems and Computers, Oct. 2008.
[10] K. Wong, C. Tsui, R. Cheng, and W. Mow. A VLSI
architecture of a K-best lattice decoding algorithm for
MIMO channels. In IEEE Int. Symp. on Circuits and
Syst., volume 3, pages 273–276, May 2002.
450
