A Scalable VLSI Architecture for Soft-Input Soft-Output Depth-First
  Sphere Decoding by Witte, Ernst Martin et al.
ar
X
iv
:0
91
0.
34
27
v4
  [
cs
.A
R]
  7
 Ju
n 2
01
0
Accepted for IEEE Transactions on Circuits and Systems II Express Briefs, May 2010. This draft will not be updated any more. Please refer to IEEE Xplore for the final version.
∗) The final publication will appear with the modified title “A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding”.
A Scalable VLSI Architecture for Soft-Input
Soft-Output Depth-First Sphere Decoding∗
Ernst Martin Witte, Filippo Borlenghi, Gerd Ascheid, Senior IEEE, Rainer Leupers, Heinrich Meyr, Fellow IEEE
Abstract—Multiple-input multiple-output (MIMO) wireless
transmission imposes huge challenges on the design of efficient
hardware architectures for iterative receivers. A major challenge
is soft-input soft-output (SISO) MIMO demapping, often ap-
proached by sphere decoding (SD). In this paper, we introduce
the—to our best knowledge—first VLSI architecture for SISO
SD applying a single tree-search approach. Compared with a
soft-output-only base architecture similar to the one proposed by
Studer et al. in IEEE J-SAC 2008, the architectural modifications
for soft input still allow a one-node-per-cycle execution. For a 4×4
16-QAM system, the area increases by 57 % and the operating
frequency degrades by 34 % only.
Index Terms—VLSI architecture, Schnorr-Euchner (SE) enu-
meration, iterative multiple-input multiple-output (MIMO) de-
coding, soft-input soft-output (SISO) sphere decoding (SD)
I. INTRODUCTION
Multiple-input multiple-output (MIMO) wireless transmis-
sions utilizing spatial multiplexing achieve an increased spec-
tral efficiency compared with single-antenna systems. This im-
provement comes at the cost of an increased signal-demapping
complexity, which becomes particularly critical for iterative
receivers [1]. Recent developments of soft-input soft-output
(SISO) MIMO-demapping algorithms reduced this complexity
significantly. Prominent demapping algorithms are k-best and
list-based approaches [2], [3], Markov chain Monte Carlo
algorithms (MCMC) [4] and single tree-search (STS) sphere
decoders (SD) [5]. The STS approach is often preferred since it
guarantees max-log maximum a posteriori (MAP) optimality.
Efficient VLSI implementations have been proposed for
soft-output-only STS SDs [6], [7] exploiting geometric prop-
erties of QAM constellations. These geometric relations help
determining a search order, defined as enumeration, leading
to a fast average tree-search convergence. The SISO STS
complexity has been prohibitive for VLSI implementations so
far, because geometric relations are not applicable directly. Re-
cent improvements of soft-input enumeration strategies moved
SISO STS SD closer to VLSI architectures [8].
Contributions: In this paper, we introduce the—to our best
knowledge—first VLSI architecture for SISO STS SD. It is
based on a soft-output-only architecture following the one-
node-per-cycle (ONPC) paradigm used by [6]. The SISO
modifications are modular enough to be applied to other
Manuscript received October 26, 2009; revised April 5, 2010. This work
has been supported by the UMIC (Ultra High-Speed Mobile Information and
Communication) Research Centre at the RWTH-Aachen University.
The authors are with the Institute for Integrated Signal Processing
Systems, RWTH-Aachen University, D-52056 Aachen, Germany (email:
{witte,borlenghi,ascheid,leupers,meyr}@iss.rwth-aachen.de).
existing STS SD architectures and still allow ONPC execu-
tion. Compared with a soft-output-only architecture, the area
increases by 57 % and the clock frequency degrades by 34 %
for a 4 × 4 16-QAM system. Thus, this architecture enables
STS-based iterative wireless MIMO receivers.
The paper is organized as follows: Section II sums up the
basics of SISO STS SD, extended by the soft-input enumer-
ation strategy in Section III. Section IV describes important
implementation aspects of the scalable VLSI architecture. In
Section V the parameter design space of the SISO STS archi-
tecture as well as area, timing and throughput are discussed.
II. SINGLE TREE-SEARCH SOFT-INPUT SPHERE
DECODING
A spatial-multiplexing MIMO scheme with MT transmit
and MR ≥ MT receive antennas is assumed [1]. Each
transmit antenna sends one of the 2Q complex elements of
the symbol set O defined by the modulation alphabet, which
is assumed to be the same for every antenna. Each vector
s = [s1, ..., sMT ]
T
∈ OMT results from mapping MTQ bits
xi,b ∈ {+1,−1} to an element of OMT , with i being the
antenna index and b the bit index for one scalar symbol si.
The received symbol vector y ∈ CMR is given by
y = Hs + n, where H ∈ CMR×MT is the channel matrix
and n ∈ CMR is a white circular Gaussian noise vector with
variance N0 per element. For tree-search SD, H is typically
QR-decomposed (QRD) with H = QR, Q ∈ CMR×MT and
QHQ = I and R ∈ CMT×MT being an upper triangular matrix
[1], [5]. With y˜ = QHy and n˜ = QHn, this results in
y˜ = Rs + n˜ . (1)
According to [5], the triangular matrix R in equation (1)
allows to formulate the SISO max-log MAP MIMO detection
problem as STS within a 2Q-ary complete tree. The tree levels
correspond to the MT antennas, each node si ∈ O on tree level
i is a received symbol candidate, with s1 being a leaf node. An
exhaustive search in such a tree leads to a worst-case run-time
complexity of O(2QMT). As formalized in equations (2) to (4),
metric increments MC(si) for channel-based and MA(si) for
a priori-based information are summed up to a total increment
MP(si). P[si] is the symbol probability computed from the a
priori log-likelihood ratios (LLRs) LAi,b.
MA(si) = − logP[si] (2)
MC(si) =
1
N0
|y˜i −
MT∑
j=i
Ri,jsj |
2 (3)
MP(si) = MC(si) +MA(si) (4)
April 7, 2010 1 DRAFT
Accepted for IEEE Transactions on Circuits and Systems II Express Briefs, May 2010. This draft will not be updated any more. Please refer to IEEE Xplore for the final version.
∗) The final publication will appear with the modified title “A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding”.
The sum of metric increments along a path from the root to
node si yields the partial metric MP(s(i)) for a partial symbol
vector s(i) = [si, ..., sMT ]
T :
MP(s
(i)) =
MT∑
j=i
MP(sj) (5)
During a STS, the MAP solution sMAP, its bits xMAPi,b and metric
λMAP = MP(s
MAP) and extrinsic counter-hypothesis metrics
ΛMAPi,b are computed by successively improving the current
metrics λMAP,cur and ΛMAP,curi,b . LEi,b are extrinsic LLRs with
sMAP = argmin
s∈OMT
{MP(s)}
ΛMAPi,b = min
s∈OMT∧xi,b 6=xMAPi,b
{MP(s)} − L
A
i,bx
MAP
i,b
LEi,b =
(
ΛMAPi,b − λ
MAP
)
xMAPi,b .
These metric computations dominate the detection complex-
ity. For a depth-first tree search, the pruning of sub-trees lying
outside a hypersphere with a radius not improving λMAP,curi,b =
ΛMAP,curi,b +L
A
i,bx
MAP
i,b provides a heuristic for complexity reduc-
tion which is sensitive to the visiting order [s(1)i , ..., s
(|O|)
i ].
A Schnorr-Euchner (SE) order [9] provides a very fast search
convergence by the following pruning criteria [6], typically
defining the pruning metrics Mdownprn,j :=Msibl.prn,j :=MP(s(j)):
Mdownprn,j ≥ max
{
λMAP,curi,b
∣∣∣ i < j ∨ xi,b 6= xMAP,curi,b , ∀b} (6)
Msibl.prn,j ≥ max
{
λMAP,curi,b
∣∣∣ i ≤ j ∨ xi,b 6= xMAP,curi,b , ∀b} (7)
If inequality (6) holds, the current node and its sub-tree
are pruned, otherwise a step down is performed in the tree.
If inequality (7) holds, the enumeration on level j stops,
otherwise the sibling of the current node is enumerated. The
arguments of the max operators in (6) and (7) are the sets A
and B respectively in [6]. We define an examined node (as used
in [6] and [7]) as a node sj that has been checked against at
least one pruning criterion, leading to the complexity measure
number of examined nodes per detected symbol vector Nen.
If a leaf node with MP(s) ≥ λMAP,cur is not pruned by
inequalities (6) or (7), the values {ΛMAP,curi,b |xi,b 6= xMAP,curi,b }
need to be updated by min {ΛMAP,curi,b ,MP(s)− LAi,bx
MAP,cur
i,b }.
Otherwise, if MP(s) < λMAP,cur, the current leaf becomes
the new MAP solution and the extrinsic counter-hypothesis
metrics {ΛMAP,curi,b |x
MAP,old
i,b 6= x
MAP,cur
i,b } are updated by
min {ΛMAP,curi,b , λ
MAP,old − LAi,bx
MAP,cur
i,b }.
Many methods exist to reduce Nen, like sorted QRD
(SQRD) [10] and extrinsic LLR clipping [5]. The latter one
limits the allowed range for LEi,b to |LEi,b,clipped| ≤ LEmax, which
leads to clipped extrinsic metrics ΛMAPi,b,clipped:
ΛMAPi,b,clipped =max
{
λMAP−LEmax,min
{
λMAP+LEmax,Λ
MAP
i,b
}}
(8)
Please note that equation (8) is stricter than the min{} function
used in [5] where a post-processing step is used to guarantee
|LEi,b,clipped| ≤ L
E
max for proper channel decoding. In [5], this
PSfrag replacements
MA
MC
step k
1
2
3
4
MP(s
(k)
A,i )
MP(O
(2))
MP(O
(1))
MP(O
(4))
MP(O
(4))
MP(O
(4))
MP(O
(4))
<
<
>
>
>
=
MP(s
(k)
C,i )
MP(O
(3))
MP(O
(3))
MP(O
(3))
MP(O
(1))
MP(O
(2))
MP(O
(4))
s
(k)
i
O(2)
O(1)
O(3)
skipped
skipped
O(4)
M
P (O (1)
)
M
P (O (2)
)
M
P (O (3)
)
O(1)
O(2)
O(3)
O(4)
k
=
1
k
=
2
k
∈
{
3
,4
}
k=4k∈{1,2,3}
Fig. 1. Hybrid-enumeration example, kth symbol in SE order: O(k) ∈ O.
saves 50 % of the comparisons required for clipping. Experi-
ments indicate that E[Nen] differs only marginally between the
two clipping methods. Moreover, radius tightening further re-
duces Nen. A hardware-friendly approximation of MA(si) for
statistically independent symbols, including tightening and still
guaranteeing max-log-optimal a posteriori LLRs, has been pro-
posed in [5] (with unipolar bits di,b= 12 (1−xi,b · sign(LAi,b))):
MA(si) = − logP[si] ≈
Q∑
b=1
{
|LAi,b|, di,b = 1
0, otherwise
(9)
III. THE HYBRID-ENUMERATION ALGORITHM
A major issue of SD algorithms is the enumeration process,
namely the determination of the SE order [s(1)i , ..., s
(|O|)
i ] on
a level i with s(k)i representing the kth candidate for node si,
in ascending order of MP. A straightforward implementation
by computing and fully sorting the set {MP(s(k)i )} is very
expensive and inefficient. For the soft-output-only case, the ge-
ometric properties of the QAM constellation can be exploited
to avoid full sorting and thus save most of the computations, as
proposed in [6], [7], [11]. However, in iterative receivers these
optimizations are not usable directly because the geometry-
based order is scrambled by the a priori information. A viable
approach towards efficient soft-input enumeration is given by
the hybrid-enumeration algorithm presented in [8]. Its basic
idea is to split the enumeration of {MP(s(k)i )} into two
concurrent enumerations of {MC(s(k)i )} and {MA(s
(k)
i )}.
On the one hand, the enumeration of {MC(s(k)i )} is the
same as in the soft-output-only case, thus allowing to reuse any
of the related aforementioned efficient methods, even in later
iterations. On the other hand, the enumeration of {MA(s(k)i )}
is efficient as well since the linear sorting of the symbol set O
needs to be performed independently only once per antenna.
According to [8], the channel- and a priori-based enumer-
ations independently select candidate symbols s(k)C,i and s
(k)
A,i
at each step k. The hybrid enumeration simply selects the
candidate with the lower metric MP between these two.
As visualized in Figure 1, the strict SE order is not pre-
served, hence the inequality MP(s(k)i ) ≤ MP(s
(l)
i ), ∀l > k
does not hold any more. Thus, a modification of the pruning
criteria is needed to avoid the erroneous exclusion of the MAP
or counter-hypothesis solutions. For l > k, the inequalities
MC(s
(k)
C,i ) ≤ MC(s
(l)
C,i) and MA(s
(k)
A,i) ≤ MA(s
(l)
A,i) lead to
April 7, 2010 2 DRAFT
Accepted for IEEE Transactions on Circuits and Systems II Express Briefs, May 2010. This draft will not be updated any more. Please refer to IEEE Xplore for the final version.
∗) The final publication will appear with the modified title “A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding”.
MC(s
(k)
C,i ) +MA(s
(k)
A,i) ≤ MP(s
(l)
i ), providing an alternative
lower bound for tree pruning. Thus, in [8] the pruning metric
of inequality (7) on the current tree level i is re-defined as
Msibl.prn,i :=MC(s
(k)
C,i ) +MA(s
(k)
A,i) +MP(s
(i+1)) . (10)
Compared with the SE order, pruning metric (10) preserves
the error-rate performance at the price of a slight increase
in Nen. For a more detailed description and analysis of the
hybrid-enumeration algorithm, the reader is referred to [8].
IV. A VLSI ARCHITECTURE FOR STS SOFT-INPUT
SPHERE DECODING
In this section, a VLSI architecture for SISO STS SD is
introduced. It is derived from a soft-output-only depth-first
STS base architecture extended by soft-input processing. The
main challenges are discussed that arise from the implementa-
tion of efficient soft-input extensions according to the hybrid-
enumeration scheme. Further algorithmic optimizations such
as LLR correction proposed in [5] are orthogonal to the base
architecture and can be implemented on top of it.
A. Soft-Output-Only Base Architecture
The soft-output-only base STS architecture, composed of
the light gray blocks in Figure 2, follows the ONPC execu-
tion principle used by Studer et al. in [6]. Its architectural
structure is derived from the observation that the tree search
is composed of three basic control-flow steps:
i) Vertical steps (①) down from tree level i to i−1 enumerate
the first child node s(1)i−1 of a parent node s
(k)
i . This requires
a quantization step Q to find the QAM symbol next to y˜i,
followed by the computation of MP(s(1)i−1). The result of Q is
used to initialize the enumeration on the tree level i − 1 and
by the pruning-criteria check for s(1)i−1.
ii) Horizontal steps (②) on a tree level i enumerate the node
s
(k+1)
i after enumerating the node s
(k)
i and its sub-tree. This
category also includes steps back from a child node si−1 to
the next sibling s(k+1)i of its parent node s
(k)
i .
iii) Pruning-criteria checks (③) for a node s(k)i determine if
either a vertical step to the child s(1)i−1, a horizontal step to the
sibling s(k+1)i or a horizontal step to its parent’s sibling s
(l+1)
i+1
has to be performed next. The MP history (④) unit stores the
partial metrics MP(s(i)), recursively implements equation (5)
and provides its result to unit ③ for pruning and LLR clipping
by equation (8).
In a depth-first SD, the tree-traversal control flow exhibits
severe data and control dependencies. In order to achieve a
throughput of one examined node per cycle, the base archi-
tecture executes the pruning check for node s(k)i concurrently
with the steps towards s(1)i−1 and s
(k+1)
i in cycle n. If the
pruning check selects s(1)i−1, s
(k+1)
i is saved in a preferred-
siblings cache (⑤) for later use during a step up in the tree.
Thus, in cycle n + 1 the availability of a valid node for the
next pruning check is guaranteed.
The enumeration unit of the base architecture employs the
column-wise zig-zag enumeration strategy (⑥) presented in
[11]. Compared with circular PSK-like enumeration [6], the
...
...
...
...
PSfrag replacements
ST
S
Co
n
tr
o
lF
SM
R, y˜, {LAi,b}
y˜i−
MT∑
j=i+1
Ri,jsj
M
T
2
Q
En
u
m
er
at
ed
N
o
de
s
Fl
ag
s
co
l.
zi
g-
za
g
M
C
M
C
M
C
M
C
M
A
M
A
M
A
=
0
MP
M MDD
Q
Pruning Criteria, {ΛMAPi,b }, λMAP, {xMAPi,b }
{LEi,b}
min min
MA,min
2Q : 1
min
MT − 1 Pref. Siblings
MP History
{MA}i
①
②
③
⑤
⑦
⑥
⑧
⑨
④
Fig. 2. Block diagram of the proposed soft-input STS SD VLSI architecture.
Units added/modified for soft-input are emphasized by dark gray background.
Legend: Mapper M, Demapper D, Quantizer Q.
column-wise enumeration allows a much more regular hard-
ware implementation. Furthermore, for 64 QAM and higher
modulation orders it requires less comparisons.
Since there is no assumption on the mapping between QAM
symbols and bits, two run-time-programmable lookup tables,
named mapper M and demapper D respectively, are used for
the conversion between the symbol and the bit representations.
B. Soft-Input Extensions
In order to extend the base architecture presented in Sec-
tion IV-A, mainly extra units for the a priori-based enumera-
tion have to be added, along with slight changes in the column-
wise zig-zag implementation. These extensions correspond to
the dark gray units in Figure 2.
1) Enumerated-nodes flags: Both channel- and a priori-
based enumeration units have to skip nodes that have already
been enumerated, because the local enumeration orders for
MC and MA differ from the global enumeration order.
Therefore, both units need the list of enumerated nodes to
guarantee that each node is enumerated only once. This flag
vector of 2Q bits per antenna is maintained in unit ⑦.
2) Modified column-wise zig-zag enumeration: Skipping
an arbitrary number of nodes implies modifications to the
column-wise zig-zag implementation (⑥). Compared with the
April 7, 2010 3 DRAFT
Accepted for IEEE Transactions on Circuits and Systems II Express Briefs, May 2010. This draft will not be updated any more. Please refer to IEEE Xplore for the final version.
∗) The final publication will appear with the modified title “A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding”.
base architecture, the new column-enumeration unit does not
keep internal zig-zag states any more. Instead, each column
enumeration performs a minimum search over the linear
distances between the quantized imaginary part Q(Im{y˜i −∑MT
j=i+1 Ri,jsj}) and all rows {Im{si|si∈O}} masked by the
enumerated-nodes flags. The hardware complexity increases
only moderately, because distance computations are the same
for all columns and operate on words of only Q/2 + 1 bits.
3) A priori-based enumeration: With di being the decimal
representation of the bit vector [di,Q, ..., di,1], a mapping of di
to the corresponding symbol si(di), MA(di) = MA(si(di))
and an order defined by si(d(k)i ) = s
(k)
A,i , one problem of
enumerating {MA}i = {MA(di)|0 ≤ di < 2Q} is the lack of
relations among a priori LLRs. Thus, the only known solution
is the full computation and sorting of {MA}i.
First, the computation of {MA}i (⑧) requires 2Q − Q − 1
additions per antenna and received vector. Due to the ONPC
principle and the structure of (9), the number of hardware
adders can be reduced by resource sharing. The first enumer-
ation step always results in d(1)i = 0 and MA(d
(1)
i ) = 0,
thus the subset {MA}i,L = {MA(di)|1 ≤ di ≤ 2Q−1} can
be computed concurrently. In the second step, MA(d(2)i ) =
min∀b |L
A
i,b| can be enumerated since MA(d
(2)
i ) ∈ {MA}i,L,
while the subset {MA}i,H = {MA(di)|2Q−1 < di < 2Q} can
be computed. This approach only requires 2Q−1 − 1 adders
independently from MT, yielding adder savings of 36 % for
16 QAM and 45 % for 64 QAM. Furthermore, for an ONPC
architecture, no latency is added since the subsets {MA}i,L
and {MA}i,H can be computed during the enumeration of s(1)A,i
and s(2)A,i. Further resource sharing would result in limited gains
while significantly increasing irregularity.
The second issue is sorting {MA}i. Since latency is typi-
cally a serious issue for run-time constrained depth-first SD,
an approach has been chosen that does not add latency for the
sorting of {MA}i. The ONPC principle allows a minimum
search (⑨) for MA,min over the set {MA}i for the enumeration
of the current antenna i, masked by the enumerated-nodes
flags. The resulting binary tree of compare-select (CS) units
would dominate the critical path already for 16 QAM.
However, the properties of equation (9) can be ex-
ploited to remove almost all comparators and CS depen-
dencies for the first three CS levels. The principle can be
explained easily by considering the removal of the first
level: for pairs of {MA(s(k)i ),MA(s
(l)
i )} with only one bit
{b|x
(k)
i,b 6= x
(l)
i,b} the larger metric MA(s
({k,l})
i ) is the one
with x({k,l})i,b 6= sign(LAi,b). This kind of decision does not
need any metric comparison but can be determined by single-
bit comparisons of sign bits and enumerated-nodes flags.
Selecting the minimum of 4-tuples (first two CS tree levels)
differing in only two bits {b{m,n}|x
(k)
i,b{m,n}
6= x
(l)
i,b{m,n}
}
requires an additional comparison |LAi,bm | ≷ |L
A
i,bn
|. However,
this extra comparison is the same for all 4-tuple sub-trees and
does not depend on intermediate results generated in the CS
tree. Therefore, the critical path is significantly reduced. The
extension to 8-tuples (first three CS tree levels) has a total of
only six parallel comparators. Thus, only one CS unit and two
8:1 multiplexers are required for 16 QAM and only seven CS
units and eight 8:1 multiplexers for 64 QAM. Compared with a
full CS tree, the comparator savings are 53 % in total and 50 %
in the critical path for 16 QAM and 79 % in total and 33 %
in the critical path for 64 QAM. Extensions to higher orders
than 8-tuples are possible but would result in an exponential
complexity increase.
4) Pruning-criteria checks: In [6], the checks of the prun-
ing criteria of equations (6) and (7) have been simplified to
a single pruning-criterion check of equation (7) in order to
reduce hardware complexity, at the cost of a slight increase of
Nen. For the SISO STS SD architecture proposed in this paper,
the implementation of two different pruning criteria in unit ③
is mandatory to prevent a further significant increase of Nen.
In order to avoid extra delays on the critical path, the pruning-
criteria checks are not implemented as maximum searches but
as pairs of MT2Q fully parallel comparators Mdownprn,j > λMAPi,b
and Msibl.prn,j > λMAPi,b , followed by simple bit-masking and
combining.
V. ASIC SYNTHESIS RESULTS
The architecture presented in the previous section has been
implemented in VHDL including parameters for word lengths,
MT, QAM order and a switch to enable/disable soft-input
support. A representative set of parameter combinations has
been instantiated by layout-aware gate-level synthesis1.
Since both the soft-output-only base architecture and the
SISO architecture follow the ONPC principle, their throughput
Θ can be determined by
Θ =
rQMT
E[Nen]
fclk [bit/s] (11)
with r being the code rate and E[Nen] being the average Nen.
The curves for the iterative Θ and the cumulative E[Nen] for
a 4× 4 16-QAM MIMO system2 achieving a frame error rate
(FER) of 1 % are given in Figure 3, including as a reference
the cumulative E[Nen] obtained by SE ordering and floating-
point operations. In the 4th iteration the hybrid-enumeration
algorithm introduces an overhead of less than 28 % in terms of
E[Nen]. The least-effort throughput in Figure 3 is derived from
equation (11) by selecting the minimum cumulative E[Nen]
among all iterations for a specific SNR. The intersections of
the cumulative E[Nen] curves determine the SNR points for
changing the number of iterations. In Figure 3 the switching
points are marked by ① (1 ⇄ 2 iterations), by ② (2 ⇄ 3
iterations) and by ③ (3 ⇄ 4 iterations).
Area and delay of this architecture are quite sensitive to the
fixed-point word lengths. Therefore, the word lengths have
1UMC 90 nm standard-performance CMOS library, typical case, Synopsys
Design Compiler 2009.06-sp1 in topographical mode.
2 Throughout this paper we use a system with an i.i.d. Rayleigh fading
channel, perfect channel knowledge and SQRD [10]. The BICM transmission
is set up with a convolutional channel code (rate 1/2, generator polynomials
[133o, 171o], constraint length 7) decoded by a max-log BCJR channel
decoder with perfect termination knowledge and an S-random interleaver cor-
responding to 512 information bits. The SNR is defined as SNR = MTEs/N0,
with Es = E[|s|2], s ∈ O. P[si] is approximated by equation (9). The VLSI
architecture internally operates on normalized metrics Mnorm. = N0M to
avoid division by N0, normalized clipping levels are given by N0LEmax.
April 7, 2010 4 DRAFT
Accepted for IEEE Transactions on Circuits and Systems II Express Briefs, May 2010. This draft will not be updated any more. Please refer to IEEE Xplore for the final version.
∗) The final publication will appear with the modified title “A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding”.
11 12 13 14 15 16
0
50
100
150
200
250
300
350
400
450
500
 
 
0
10
20
30
40
50
60
70
80
90
PSfrag replacements
Th
ro
u
gh
pu
t
Θ
[1
0
6
in
fo
rm
at
io
n
bi
ts
/se
c]
Minimum SNR = MTEs/N0 [dB] for 1 % FER
Cu
m
u
la
tiv
e
E
[N
en
]
E[Nen], SE order (Matlab), 1 it.
E[Nen], SE order (Matlab), 2 it.
E[Nen], SE order (Matlab), 4 it.
E[Nen], SISO ASIC, 1 it.
E[Nen], SISO ASIC, 2 it.
E[Nen], SISO ASIC, 4 it.
Least-Effort Θ, SISO ASIC
0.8
0.8
0.8
0.4
0.4
0.4
0.2
0.2
0.2
0.1
0.1
0.1
0.05
0.05
0.8
0.8
0.4
0.4
0.2
0.2
0.1
0.1
0.05
①
②③
1 it.2 it.3 it.4 it. ←←←← →→→→
Fig. 3. Cumulative E[Nen] and iterative least-effort throughput Θ over mini-
mum SNR for 1 % FER for the 4×4 16-QAM architecture. Numbers annotated
to cumulative E[Nen] curves are normalized clipping levels N0LEmax. As in
[5], one iteration is defined as one use of the SISO MIMO demapper and the
SISO channel decoder (1st iteration corresponds to soft-output-only SD).
Critical Path Delay [ns]
A
re
a
 [
k
G
E
]
QPSK
16 QAM
64 QAM 8x8 Antennas
4x4 Antennas
2x2 Antennas
Soft−output−only ASIC by Studer et. al [6]
4x4 Antennas, 16 QAM (scaled to 90 nm)
S
IS
O
so
ft−
ou
tp
ut
−o
nl
y
Fig. 4. Parametrization design space of the proposed STS SD architecture.
Area is measured in gate equivalents (GEs). One GE corresponds to the area
of a two-input drive-one NAND gate.
been carefully selected to make the FER-performance loss
negligible with respect to floating-point operation3.
Figure 4 shows the synthesis results for representative
parameter sets. The results for the soft-output-only case are
comparable to the implementation published in [6]. Since the
two base architectures are similar, they are close in terms of
area. The timing differs, mainly for two reasons. First, Figure 4
shows pre-layout synthesis results for a 90 nm technology
whereas those in [6] are post-layout results for a 250 nm
technology scaled to 90 nm by f90 ≈ 25090 f250. Second, the
architectures differ in their pipeline and enumeration schemes.
By enabling soft-input processing for the 4 × 4 16-QAM
reference, the area increases by 57 % from 61 kGates to
96 kGates, while the clock frequency degrades by 34 % from
3 Word lengths [integer.fractional] for 4× 4 16 QAM: y˜i[6.7], Ri,j [4.7],
LA
i,b
[9.5], LE
i,b
[9.5], M{C,A,P}[9.6]. A QAM-order increase of factor 4
requires one more integer bit for y˜i per real/imaginary part and two more
integer bits for M{C,A,P}, LAi,b and LEi,b. Doubling MT requires one more
integer bit for M{C,A,P}, LAi,b and L
E
i,b
.
379 MHz to 250 MHz. We can conclude that the additional
cost for soft-input is affordable at the prospect of working at
lower SNR regimes with iterative systems.
The proposed architecture scales almost linearly with MT
in terms of area. The critical path degrades only by less than
10 % when doubling MT. When increasing the QAM order
by a factor of 4 in the soft-input case, the area is less than
doubled while the frequency degrades by less than 20-25 %,
despite the enumeration being significantly affected.
VI. CONCLUSION
To our best knowledge, we introduced the first SISO STS
SD architecture, enabling iterative STS SD-based receivers.
The parametrized architecture offers very good scalability over
MT and the QAM order. The approximate hybrid-enumeration
method enables the implementation of iterative STS-based
MIMO receivers, although high data-rate communication sys-
tems may require multiple parallel SD instances to meet the
throughput constraints. We believe that the algorithms and
hardware-design principles presented in this paper are suitable
for most kinds of SD architectures. Our future development
will focus on further enhancements of the architecture, based
for instance on the ideas proposed in [6].
VII. ACKNOWLEDGEMENT
The authors would like to thank Chun-Hao Liao, I-Wei
Lai, Martin Senst, David Kammler, Andreas Minwegen, Uwe
Deidersen, Konstantinos Nikitopoulos, Dan Zhang, Jeronimo
Castrillon, Torsten Kempf, all reviewers and the editor for their
valuable feedback and support.
REFERENCES
[1] B. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-
antenna channel,” IEEE Trans. Commun., vol. 51, no. 3, pp. 389–399,
March 2003.
[2] S. Chen and T. Zhang, “Low power soft-output signal detector design
for wireless MIMO communication systems,” in ISLPED ’07: Proc. of
the 2007 international symposium on low power electronics and design.
New York, NY, USA: ACM, August 2007, pp. 232–237.
[3] M. Li et al., “Selective spanning with fast enumeration: A near
maximum-likelihood MIMO detector designed for parallel pro-
grammable baseband architectures,” in Proc. IEEE International Con-
ference on Communications ICC ’08, May 2008, pp. 737–741.
[4] S. Laraway and B. Farhang-Boroujeny, “Implementation of a markov
chain monte carlo based multiuser/mimo detector,” IEEE Trans. Circuits
Syst. I, vol. 56, no. 1, pp. 246–255, January 2009.
[5] C. Studer and H. Bo¨lcskei, “Soft-input soft-output single tree-search
sphere decoding,” June 2009. http://arxiv.org/abs/0906.0840
[6] C. Studer, A. Burg, and H. Bo¨lcskei, “Soft-output sphere decoding:
algorithms and VLSI implementation,” IEEE J. Sel. Areas Commun.,
vol. 26, no. 2, pp. 290–300, February 2008.
[7] B. Mennenga and G. Fettweis, “Search sequence determination for tree
search based detection algorithms,” in Proc. IEEE Sarnoff Symposium,
April 2009, pp. 1–6.
[8] C.-H. Liao et al., “Combining orthogonalized partial metrics: Efficient
enumeration for soft-input sphere decoder,” in Proc. IEEE 20th Inter-
national Symposium on Personal, Indoor and Mobile Radio Communi-
cations, September 2009.
[9] C. P. Schnorr and M. Euchner, “Lattice basis reduction: improved
practical algorithms and solving subset sum problems,” Math. Program.,
vol. 66, no. 2, pp. 181–199, August 1994.
[10] D. Wu¨bben et al., “Efficient algorithm for decoding layered space-time
codes,” Electronics Letters, vol. 37, no. 22, pp. 1348–1350, October
2001.
[11] C. Hess et al., “Reduced-complexity MIMO detector with close-to
ML error rate performance,” in Proc. of the 17th ACM Great Lakes
Symposium on VLSI (GLSVLSI), March 2007, pp. 200–203.
April 7, 2010 5 DRAFT
